Official Website: https://www.iceloof.com
About Me: https://hurin.iceloof.com
Hey 👋 If you like what I created, you can now buy me a coffee!
Script for GoogleNews
Home Page: https://pypi.org/project/GoogleNews/
License: MIT License
Official Website: https://www.iceloof.com
About Me: https://hurin.iceloof.com
Hey 👋 If you like what I created, you can now buy me a coffee!
Currently the getpage
method shows something like:
[{
'title': "Endgame: It's Amazing That No One Died In Thanos ...",
'media': 'Screen Rant',
'date': 'Dec. 2, 2019',
'desc': 'Avengers: Infinity War saw most of the superheroes in the MCU fight Thanos and his armies on Earth and outer space – but the Mad Titan won\xa0...',
'link': 'https://screenrant.com/avengers-endgame-thanos-attack-no-one-died-how/',
'img': 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQL__t6M0yeasih6JxReLAF1ilEcVstSQd3t-0zrGIL71-FNIeQsqCX9KiRdDuqIK17M_feDUX9&s'
}]
Is there any way to get the complete description with this package?
Thanks in advance!
I'm trying to get the news given a range, for example:
from GoogleNews import GoogleNews
googlenews = GoogleNews(start='02/01/2020',end='02/28/2020')
googlenews.get_news('trump')
for new in googlenews.results(sort=True):
print(new['date'])
But my output is this
10 minutes ago
16 minutes ago
42 minutes ago
1 hour ago
1 hour ago
1 hour ago
1 hour ago
2 hours ago
6 hours ago
6 hours ago
7 hours ago
7 hours ago
7 hours ago
8 hours ago
9 hours ago
9 hours ago
10 hours ago
10 hours ago
12 hours ago
13 hours ago
13 hours ago
16 hours ago
17 hours ago
20 hours ago
23 hours ago
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
3 days ago
3 days ago
3 days ago
3 days ago
4 days ago
4 days ago
4 days ago
4 days ago
4 days ago
4 days ago
5 days ago
6 days ago
6 days ago
6 days ago
6 days ago
7 days ago
7 days ago
7 days ago
8 days ago
8 days ago
9 days ago
10 days ago
10 days ago
10 days ago
11 days ago
11 days ago
11 days ago
11 days ago
12 days ago
13 days ago
13 days ago
13 days ago
14 days ago
14 days ago
Oct 31
Oct 31
Oct 28
Oct 27
Oct 27
Oct 26
Oct 26
Oct 26
Oct 25
Oct 24
Oct 23
Oct 23
Oct 23
Oct 22
Oct 22
Oct 22
Oct 21
Oct 21
Oct 21
Oct 21
Oct 20
Oct 20
Oct 19
Oct 19
Oct 19
Oct 18
Oct 5
Sep 1
Aug 18
Aug 10
And as you can see it's totally different of the expected. How can I fix it or it's a bug?
Thanks
Does this lib work? I get empty results all the time.
I installed this module using pip
Hi,
I have tried to use the package like this:
from GoogleNews import GoogleNews
googlenews = GoogleNews()
It gives this error ImportError: cannot import name 'GoogleNews'
PS: I am using Pycharm to load package!
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1124)>
This error was returned when using the sample code
Is there a way to get the entire title during a research? For example: I search "Biden's win" with Google News. The title of a result is "Alaska Sen. Dan Sullivan acknowledges President-elect Joe..." , and not "Alaska Sen. Dan Sullivan acknowledges President-elect Joe Biden’s win".
I get an empty list and I checked my network. Something is up
googlenew.search('МАЗ') does not return results.
However, encoding the string into the CP1251 works is per this SO: https://stackoverflow.com/questions/24234987/urlencode-cyrillic-characters-in-python
May be you should add search key encoding before consutructing the URL:
from urllib import request
__key = request.quote(__key.encode('cp1251'))
I have been running this script in AWS EC2 through a virtual environment running on python34 and coming back empty; when I run it locally in PyCharm I am getting a result, do you know why is this the case?
import requests
from bs4 import BeautifulSoup
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('forbes')
url = googlenews.getlinks()
name = googlenews.result()
print(url[0])
Hi there,
Thank you for creating a wonderful package for crawling google new. It works well.
Just one question, is it possible to set the specific website for searching like we use it in Google as below?
site:www.bbc.com
I tried to set it within googlenews=GoogleNews() , just like I set the search language and start&end date, but it doesn't work.
It would be very useful to set specific website when we use this package.
Thank you very much!
returns 0 for all topics
I have a script that returns nothing, but my friend can scrape using the same script. I assume Google ip banned me? I only scraped like 200 article urls.
Thanks for the great work. I would like to be able to use Google's search techniques as described here.
I tried to achieve this as follows:
googlenews = GoogleNews(lang='en', encode='utf-8')
# googlenews.search('coronavirus') # with this, I saw some articles about Scotland. So I changed it to ->
googlenews.search('coronavirus -scotland')
news = googlenews.results()
But the news are still related to Scotland.
Is it possible to have this feature? It would be very good to have Google's search operators.
I am wondering how to set the date range. In the readme it only specifies googlenews.setperiod('d'). What does the 'd' do? How would I be able to specify, for instance, January 1st 2020 until January 2nd 2020? Thanks!
Currently, I did get news in Japanese, datetime is set to None.
But I did changed language to en, datetime is setted. please fix it.
from GoogleNews import GoogleNews
import datetime
dt_now = datetime.datetime.now().date()
googlenews = GoogleNews(lang="ja", encode='utf-8')
googlenews.set_time_range(start=dt_now, end=dt_now)
googlenews.get_news("ビットコイン")
results = googlenews.results()
for result in results:
print(result["datetime"])
Are the following options suitable for the period parameter? If not, are there other options rather than 'd'?
'h': past hour.
'd': past 24 hours.
'7d': past week.
'm': past month.
'1y': past year.
I got them on the google news website but when I use 7d or m I retrieve only news of the current day or more than 7 days/month old...
Is there a way to get the entire news article text from this or even an entire paragraph containing that particular keyword as currently we can get just 1 or 2 lines followed by '...' ??
For ex. 'Description': "Jio's parent Reliance Industries (RIL)\u2060, a conglomerate with businesses ranging from oil and petrochemicals to technology, retail and telecom\u2060, is less than a ..."
Is there a way to get full description???
Getting the following error
HTTP Error 429: Too Many Requests
Any clues to fix this?
Hello, I have tried the most basic searches with no success:
googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))
This returns 0.
Having this issue on v1.3.8
Any help would be appreciated, thank you.
I can't seem to be able to download articles due to this error. How do I bypass this? it seems to be happening everytime it goes through a specific URL. Basically if its not able to go through that website I would like it to skip and keep searching. Thank you for your help in the matter.
Hi HurinHu, great work! I was wondering if i can set the language of the google news im fetching, Thanks !
Hi,
Thanks for making this module. Unfortunately, I only get empty lists returned when trying to make searches. Any thoughts to what the issue might be? I've tried the basic example below:
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('APPL')
news = googlenews.result()
print(news)
change to correct grammer : get__links()
I've tried this a couple of ways but I'm still not getting any search results... any idea what I'm doing wrong here? Code:
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('dinosaur')
links = [googlenews.getlinks()]
for link in links:
print(link)
search()
returning news for the query searched.search()
return an empty list, no matter what I search for.>>> from GoogleNews import GoogleNews
>>> api = GoogleNews()
>>> api.setlang('en')
>>> api.setperiod('d')
>>> api.setencode('utf-8')
>>> api.search('Bitcoin')
>>> api.result()
[]
And if I try to clear the result and search other stuff, I still get the same empty list:
>>> api.clear()
>>> api.search('Amazon')
>>> api.result()
[]
>>> api.clear()
>>> api.search('APPLE')
>>> api.result()
[]
But if I search for news using get_news()
, I'm currently getting the news as usual:
>>> api.clear()
>>> api.get_news('Tesla')
>>> api.results()
[{'title': "Kelly Evans: We're all buying Tesla at the highs", 'desc': "My polite term for Tesla's valuation right now is “insane.” This is a company worth more than $550 billion. That's not only a staggering sum--making it the biggest ..
.", 'date': '30 days ago', 'datetime': datetime.datetime(2020, 12, 1, 20, 2, 27), 'link': 'news.google.com/./articles/CAIiEFYt9BkrzON5lyF3cHkseAsqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen', 'img': 'https://lh3.google
usercontent.com/g3nUNRA_3mBrV6g866OaK6tmy-ageNNgBKmnW86A3RVcBjxjd4-V-jZ6AdMrB15IVmJI0-50H8cWAiGLxNc=-p-df-h100-w100', 'media': None, 'site': 'CNBC'}
....]
I'm using GoogleNews from a virtualenv
, this is what I have tried:
Working only with get_news() function
Working only with get_news() function
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
I would like to understand where the issue is happening and fix it on Python 3.6 and 3.8. It will be great if we can find a solution for this issue happening only on the search()
function.
Over the last few days I observed the following.
April 11th: working all day, only doing small non consecutive searches
April 12th:
April 13th:
April 14th:
(noon): working, simple test
Are there some limits or guidelines we should follow while this project is under development?
partially initialized module 'GoogleNews' has no attribute 'get_news' (most likely due to a circular import)
cannot import name 'GoogleNews' from partially initialized module 'GoogleNews' (most likely due to a circular import)
from GoogleNews import GoogleNews
#intialize search and saving
googlenews = GoogleNews()
googlenews.setlang('en')
googlenews.setperiod('d')
googlenews.search("tesla")
full_results= googlenews.result()
for result in full_results:
print(result["date"])
print(result["link"])
Hello, there's possible to search the news with my language(Thai)?
I've tried to search news with Thai language and it's returned [ ].
Whenever i try to get data it downloads only data in english. Here is the code:
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews = GoogleNews(lang = 'de')
googlenews = GoogleNews(start ='01/05/2019' ,end ='07/01/2019')
googlenews.search('Frankfurt')
result = googlenews.result()
for n in range(len(result)):
print(n)
for index in result[n]:
print(index, '\n', result[n][index])
print(len(result))
and here is part of output:
0
title
Penalty heartbreak for Eintracht Frankfurt as Chelsea book ...
media
Deutsche Welle
date
9 may 2019
desc
The German side came so close to the Europa League final, but finally fell to Chelsea on penalties. Europa League - FC Chelsea v Eintracht Frankfurt | ...
link
https://www.dw.com/en/penalty-heartbreak-for-eintracht-frankfurt-as-chelsea-book-english-europa-league-final/a-48678619
img
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTpPc81Q_BdksRefghIs5VOt_w4cnPiye5fKYNd3KMkv-RywSnEgnsqncYLUXTychL9hH5N83s&s
1
title
Germany: Russian millionaire killed in Frankfurt plane crash
media
Deutsche Welle
date
31 mar 2019
desc
One of Russia's richest women, Natalia Fileva, has died in a plane crash near Frankfurt, Germany. The cause of the accident was not immediately clear.
link
https://www.dw.com/en/germany-russian-millionaire-killed-in-frankfurt-plane-crash/a-48138292
img
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR1H3RncCLG4oX1aNgI9Mljc8Diti3_ILA_Nki-4TAJIF1qa1rWChdlDHQbHFqJKoIU0iRe5qY8&s
I am using Ubuntu 18.04.4 LTS. I tried to change my system locale to LANG=de_DE.UTF-8. Unfortunately, it didnt help... Thank you in advance.
It would be great to search only the selected topic such as covid-19 and certain geographic regions like Africa
Topic Page: https://news.google.com/topics/CAAqIggKIhxDQkFTRHdvSkwyMHZNREZqY0hsNUVnSmxiaWdBUAE?hl=en-US&gl=US&ceid=US%3Aen
Geographic Boundary: https://news.google.com/topics/CAAqBwgKMJ25lwsw5uKuAw?hl=en-US&gl=US&ceid=US%3Aen
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews = GoogleNews(start='17/02/2021',end='22/02/2021')
news_keywords = ["Apple","Microsoft"etc]
check = ['17/02/2021','18/02/2021','19/02/2021','20/02/2021','21/02/2021','22/02/2021']
cols =['Date','Title','Description','provider','company']
lst = []
for i in news_keywords:
print(i)
googlenews.search(i)
googlenews.get_page(1)
googlenews.get_page(2)
etc
currently returns
Apple
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
for multiple companies over this time period.
Have ran the same code for each week over the past months so unsure if mabye im doing something wrong with latest update?
I used the API extensively for a few months. I came back it to it today after a few weeks and seem to not be returning any results. I could be blocked by some server on the way. Is the API still working for others?
Code used for test.
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('tesla')
if googlenews.result() == []:
print("failed")
else:
print("passed")
print(googlenews.get__links())
Add option to get direct links to articles instead of Google AMP links.
I did a monthly search and the number of returned results never surpasses 300. Is that a hard limit that cannot find a workaround?
Instead of getting an exact date I get '1 month ago' in the results document. How can i fix that? Thank you for your help
from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
#config will allow us to access the specified url for which we are #not authorized. Sometimes we may get 403 client error while parsing #the link to download the article.
nltk.download('punkt')
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 5
googlenews=GoogleNews(start='10/19/2020',end='10/19/2020')
googlenews.search('test')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,5):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)
list=[]
for ind in df.index:
dict={}
article = Article(df['link'][ind],config=config)
try:
article.download()
article.parse()
article2 = article.text.split()
except:
print('***FAILED TO DOWNLOAD***', article.url)
continue
# article.download()
# article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")
Hi,
Do I need to set the google API key?
I was trying and it returning null only.
Let me know what other details you might need.
Hi,
I believe you're missing an import statement -> in your 'define_date' function you refer to np, but it isn't imported as far as I can see.
see -> return float(np.nan)
Best regards,
Mike
I encountered an issue with titles.
{'title': 'Crypto miners halt China business after Beijing cracks down ...',
'media': 'The Straits Times', 'date': '4 mins ago', 'datetime': None,
'desc': 'TOP, are suspending their China operations after Beijing stepped up its efforts to crack down on bitcoin mining and trading, sending the digital currency tumbling ...',
'link': 'https://www.straitstimes.com/business/companies-markets/bitcoin-volatility-puts-weekend-traders-on-stomach-churning-ride',
'img': 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=='}
As it can be seen in my provided code, the title is shortened. Can I know the reason and a way to "opt-out" of the shortening?
I‘m looking forward to the function to get the news within the specified date range, such as the news about covid-19 from 2020.03.01 to 2020.03.30.I think it wiil be more convinient for text analysis.
Thank you for this job!That is great.
Is there a way to get the full news in the description section or at least a paragraph instead of half sentences finishing with '...'????
Dear Author @HurinHu ,
Thanks for the package !
Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?
Hello,
you can see in my fork the algorithm I have used for retreiving the datetime from the news. It is based on this package. I have also included the descending order for the news among the set collected.
You can see from a new test script how the algorithm tackle rubbish from html (usually words ahead date, not parsable).
ps. it can be improved further by removing copies of the same papers.
Hi,
I'm trying to web search multiple terms using GoolgeNews. My script used to work on older versions of GoolgeNews but it no longer works. It only searches the first term multiple times, and gives me repeating results.
Any helps is greatly appreciated. I need this fix by Wednesday for work.
from GoogleNews import GoogleNews
import sys
f = open("googlenews22.txt", "w")
keywordlist = ['apple', 'samsung', 'nokia']
googlenews = GoogleNews()
for word in keywordlist:
googlenews.search(word)
googlenews.setTimeRange('15/05/2020','15/06/2020')
googlenews.getpage(1)
results = googlenews.result()
listofres = []
for ting in word:
title = ting['title']
date = ting['date']
link = ting['link']
listofres += [[title, date, link]]
f.write("%s, %s, %s \n" %(title, date, link))
Get customized article feed if logged in
Hello author. I want to report this strange bug.
I tested on local with your library and still working normal. the "entries" part is still exists:
https://pastebin.com/xYs8pmkv
..........
'entries': [
{
'title': 'Angela Merkel tiêm vaccine: Liều một AstraZeneca, liều hai Moderna - BBC Tiếng Việt',
'title_detail': {
'type': 'text/plain',
'language': None,
'base': '',
'value': 'Angela Merkel tiêm vaccine: Liều một AstraZeneca, liều hai Moderna - BBC Tiếng Việt'
},
......
But when im trying to deployed on https://railway.app/, the entries somehow return an empty "entries"
The log: https://pastebin.com/Sj5ARuVN
....
'entries': [
// empty
]
The code im using
gn = GoogleNews(lang = 'vi')
search = gn.search('Covid-19', when = '1h')
print(search)
entries = search['entries']
print(entries)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.