msnews / mind Goto Github PK
View Code? Open in Web Editor NEWMicrosoft News Dataset
Microsoft News Dataset
There are about 50,000 urls with 503 errors that fail to get the body. I tried it many times on different servers. I'm afraid those pages are no longer available. Are there any other ways to access these pages?
Thx : )
Part of the log is as follows:
2020-10-22 07:25:22 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJXRIf.html> (referer: None)
2020-10-22 07:25:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJhftY.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJSZ1k.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAIVDlM.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJXRIf.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIDhqf.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAJ2VwG.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJ2VwG.html> (referer: None)
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJ74bz.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJ2VwG.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAIJD0a.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAIJD0a.html> (referer: None)
2020-10-22 07:25:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAIJD0a.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAJ74Fs.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJ74Fs.html> (referer: None)
2020-10-22 07:25:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIP2NI.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:24 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAIJDGD.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:24 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAIJDGD.html> (referer: None)
2020-10-22 07:25:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJBRTs.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:24 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJ74Fs.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:24 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAIJDGD.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJmVXT.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIJEks.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAJSZ5I.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJSZ5I.html> (referer: None)
2020-10-22 07:25:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJSZ5I.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJyI9B.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJmVYC.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJXRX7.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAJyI1B.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJyI1B.html> (referer: None)
2020-10-22 07:25:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIVDyW.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJyI1B.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJhgac.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIVE03.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJtewR.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:26 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://assets.msn.com/labs/mind/AAJ74aR.html> (failed 3 times): 503 Service Unavailable
2020-10-22 07:25:26 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://assets.msn.com/labs/mind/AAJ74aR.html> (referer: None)
2020-10-22 07:25:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIP2Ub.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://assets.msn.com/labs/mind/AAJ74aR.html>: HTTP status code is not handled or not allowed
2020-10-22 07:25:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJb3NF.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJmVfa.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJXRcB.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJtf1W.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJmVgq.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJBRoU.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJ2WTB.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJyIJ6.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJMhce.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJ2WWw.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIP2cL.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:29 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIDhqf.html> (failed 2 times): 503 Service Unavailable
2020-10-22 07:25:29 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIZK72.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:29 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJ74bz.html> (failed 2 times): 503 Service Unavailable
2020-10-22 07:25:29 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJMhdb.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIJPvD.html> (failed 1 times): 503 Service Unavailable
2020-10-22 07:25:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAIP2NI.html> (failed 2 times): 503 Service Unavailable
2020-10-22 07:25:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://assets.msn.com/labs/mind/AAJBRTs.html> (failed 2 times): 503 Service Unavailable
Hi, I want to obtain news that is not in the MIND dataset, for data augmentation. What is the correct way to get a large number of news url?
This is Ying from MS News :)
can you share the test dataset?
or can I have a late submission for this competition? https://competitions.codalab.org/competitions/24122
thanks.
首先,感谢您们在研究上的伟大贡献
想要请教,我如何从 behaviors.tsv 当中,提取出groundtruth,并验证推荐演算法的各项指标。
我的程式码是参考自:
https://github.com/msnews/MIND/blob/master/evaluate.py
从程式码看起来可以得到一个名为truth.txt的档案,但我不知道该如何取得。
或是有其他更好的做法,也烦请多多指导和指教
非常感谢协助!
considering full train data, threre were 101526 news records in the file news.tsv, but the unique num of title were only 98387.
hi, thanks to your crawler script.
I did run your code.
but, i having some problem with code
"with open(os.environ["MIND_NEWS_PATH"], 'r') as f:"
error message was this
"raise KeyError(key) from None
KeyError: 'MIND_NEWS_PATH'"
is it caused by path?
then i need your computer "MIND_NEWS_PATH" path?
I get the following error while trying to create the conda environment using the yaml file.
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
Could you share the real label of impression in the test set? So that we can evaluate by ourselves.
In dataset MIND-small (downloaded from https://msnews.github.io/), it seems that at least the following entities from news.tsv
is missing in entity_embedding.vec
(in both training set and validation set).
['Q1088533',
'Q1101080',
'Q12618419',
'Q1268659',
'Q1274953',
'Q13479957',
'Q1369657',
'Q1433118',
'Q14641401',
'Q1518821',
'Q15957025',
'Q1618101',
'Q16941365',
'Q16971395',
'Q16974590',
'Q17005981',
'Q17075762',
'Q17157233',
'Q1759320',
'Q18206822',
'Q1830767',
'Q1974060',
'Q1985787',
'Q19876954',
'Q2213191',
'Q22909405',
'Q22953685',
'Q23016498',
'Q24963455',
'Q25098896',
'Q2914964',
'Q2920489',
'Q3115816',
'Q3309248',
'Q3560381',
'Q3642692',
'Q3988',
'Q43078954',
'Q47195',
'Q4870032',
'Q48834885',
'Q5161167',
'Q5223107',
'Q5255734',
'Q5433004',
'Q5504038',
'Q55604490',
'Q5571889',
'Q55956059',
'Q56248373',
'Q56276162',
'Q56315722',
'Q5643122',
'Q5956375',
'Q59608476',
'Q60745527',
'Q60756198',
'Q60767275',
'Q6148929',
'Q6322233',
'Q6774029',
'Q6815811',
'Q6841333',
'Q6972751',
'Q6988097',
'Q7061735',
'Q7090844',
'Q7098673',
'Q7158944',
'Q7233544',
'Q724143',
'Q7246726',
'Q7250831',
'Q7305326',
'Q7316299',
'Q7381085',
'Q7389710',
'Q7732041',
'Q7784916',
'Q7826008',
'Q7845560',
'Q7883884',
'Q8023958',
'Q8026356']
Take Q1088533
for example, I opened the four files (news.tsv
, entity_embedding.vec
in training and validation set) and did a global search. But it was not found in entity_embedding.vec
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.