Giter Club home page Giter Club logo

googlenews's Introduction

googlenews's People

Contributors

baconian avatar c0008 avatar cyanide2489563 avatar david-xhf avatar dmifer avatar dmil avatar emremrah avatar hugocool avatar hurinhu avatar la-strole avatar mcilento93 avatar mjlabe avatar prjvvl avatar rbshadow avatar samwesley avatar soulrein avatar wastu01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

googlenews's Issues

How to get complete article description (instead of first few sentences)

Currently the getpage method shows something like:

[{
'title': "Endgame: It's Amazing That No One Died In Thanos ...", 
'media': 'Screen Rant', 
'date': 'Dec. 2, 2019', 
'desc': 'Avengers: Infinity War saw most of the superheroes in the MCU fight Thanos and his armies on Earth and outer space – but the Mad Titan won\xa0...',
'link': 'https://screenrant.com/avengers-endgame-thanos-attack-no-one-died-how/',
'img': 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQL__t6M0yeasih6JxReLAF1ilEcVstSQd3t-0zrGIL71-FNIeQsqCX9KiRdDuqIK17M_feDUX9&s'
}]

Is there any way to get the complete description with this package?

Thanks in advance!

Date don't work as the expected

I'm trying to get the news given a range, for example:

from GoogleNews import GoogleNews
googlenews = GoogleNews(start='02/01/2020',end='02/28/2020') 
googlenews.get_news('trump')
for new in googlenews.results(sort=True):
    print(new['date'])

But my output is this

10 minutes ago
16 minutes ago
42 minutes ago
1 hour ago
1 hour ago
1 hour ago
1 hour ago
2 hours ago
6 hours ago
6 hours ago
7 hours ago
7 hours ago
7 hours ago
8 hours ago
9 hours ago
9 hours ago
10 hours ago
10 hours ago
12 hours ago
13 hours ago
13 hours ago
16 hours ago
17 hours ago
20 hours ago
23 hours ago
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
Yesterday
3 days ago
3 days ago
3 days ago
3 days ago
4 days ago
4 days ago
4 days ago
4 days ago
4 days ago
4 days ago
5 days ago
6 days ago
6 days ago
6 days ago
6 days ago
7 days ago
7 days ago
7 days ago
8 days ago
8 days ago
9 days ago
10 days ago
10 days ago
10 days ago
11 days ago
11 days ago
11 days ago
11 days ago
12 days ago
13 days ago
13 days ago
13 days ago
14 days ago
14 days ago
Oct 31
Oct 31
Oct 28
Oct 27
Oct 27
Oct 26
Oct 26
Oct 26
Oct 25
Oct 24
Oct 23
Oct 23
Oct 23
Oct 22
Oct 22
Oct 22
Oct 21
Oct 21
Oct 21
Oct 21
Oct 20
Oct 20
Oct 19
Oct 19
Oct 19
Oct 18
Oct 5
Sep 1
Aug 18
Aug 10

And as you can see it's totally different of the expected. How can I fix it or it's a bug?
Thanks

ImportError: cannot import name 'GoogleNews'

Hi,
I have tried to use the package like this:
from GoogleNews import GoogleNews
googlenews = GoogleNews()
It gives this error ImportError: cannot import name 'GoogleNews'
PS: I am using Pycharm to load package!

How to get all article title?

Is there a way to get the entire title during a research? For example: I search "Biden's win" with Google News. The title of a result is "Alaska Sen. Dan Sullivan acknowledges President-elect Joe..." , and not "Alaska Sen. Dan Sullivan acknowledges President-elect Joe Biden’s win".

GoogleNews on EC2

I have been running this script in AWS EC2 through a virtual environment running on python34 and coming back empty; when I run it locally in PyCharm I am getting a result, do you know why is this the case?

import requests
from bs4 import BeautifulSoup
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search('forbes')
url = googlenews.getlinks()
name = googlenews.result()
print(url[0])

Is it possible to set the specific website?

Hi there,

Thank you for creating a wonderful package for crawling google new. It works well.

Just one question, is it possible to set the specific website for searching like we use it in Google as below?

site:www.bbc.com

I tried to set it within googlenews=GoogleNews() , just like I set the search language and start&end date, but it doesn't work.

It would be very useful to set specific website when we use this package.

Thank you very much!

Blacklisted by Google News?

I have a script that returns nothing, but my friend can scrape using the same script. I assume Google ip banned me? I only scraped like 200 article urls.

Searching a query with Google's search techniques

Thanks for the great work. I would like to be able to use Google's search techniques as described here.

I tried to achieve this as follows:

googlenews = GoogleNews(lang='en', encode='utf-8')
# googlenews.search('coronavirus')  # with this, I saw some articles about Scotland. So I changed it to ->
googlenews.search('coronavirus -scotland')
news = googlenews.results()

But the news are still related to Scotland.

Is it possible to have this feature? It would be very good to have Google's search operators.

How to set date range

I am wondering how to set the date range. In the readme it only specifies googlenews.setperiod('d'). What does the 'd' do? How would I be able to specify, for instance, January 1st 2020 until January 2nd 2020? Thanks!

I want to get datetime in Japanese news.

Currently, I did get news in Japanese, datetime is set to None.
But I did changed language to en, datetime is setted. please fix it.

from GoogleNews import GoogleNews
import datetime

dt_now = datetime.datetime.now().date()


googlenews = GoogleNews(lang="ja", encode='utf-8')
googlenews.set_time_range(start=dt_now, end=dt_now)
googlenews.get_news("ビットコイン")
results = googlenews.results()


for result in results:
    print(result["datetime"])

Options for 'period' parameter

Are the following options suitable for the period parameter? If not, are there other options rather than 'd'?
'h': past hour.
'd': past 24 hours.
'7d': past week.
'm': past month.
'1y': past year.

I got them on the google news website but when I use 7d or m I retrieve only news of the current day or more than 7 days/month old...

How to get full news article

Is there a way to get the entire news article text from this or even an entire paragraph containing that particular keyword as currently we can get just 1 or 2 lines followed by '...' ??
For ex. 'Description': "Jio's parent Reliance Industries (RIL)\u2060, a conglomerate with businesses ranging from oil and petrochemicals to technology, retail and telecom\u2060, is less than a ..."

Is there a way to get full description???

HTTP 429 error

Getting the following error

HTTP Error 429: Too Many Requests

Any clues to fix this?

No Results

Hello, I have tried the most basic searches with no success:

googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))

This returns 0.
Having this issue on v1.3.8

Any help would be appreciated, thank you.

Article `download()` failed with 403 Client Error: Forbidden for url:

I can't seem to be able to download articles due to this error. How do I bypass this? it seems to be happening everytime it goes through a specific URL. Basically if its not able to go through that website I would like it to skip and keep searching. Thank you for your help in the matter.

languages settings

Hi HurinHu, great work! I was wondering if i can set the language of the google news im fetching, Thanks !

Empty list always returned

Hi,

Thanks for making this module. Unfortunately, I only get empty lists returned when trying to make searches. Any thoughts to what the issue might be? I've tried the basic example below:

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('APPL')
news = googlenews.result()
print(news)

0 Search Results

I've tried this a couple of ways but I'm still not getting any search results... any idea what I'm doing wrong here? Code:

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('dinosaur')
links = [googlenews.getlinks()]
for link in links:
    print(link)

Search Return Empty List, no Matter What I Search

Expected Behavior

  • Google News search() returning news for the query searched.

Current Behavior

  • GoogleNews search() return an empty list, no matter what I search for.

Possible Solution

  • I thought it was maybe happening because of the server I'm using, it is hosted in North Carolina, and I have already tried the code on my local machine and also in a server in Europe and it worked fine, so I'm starting to think that the issue is can be related somehow with the server.

Steps to Reproduce

On my local machine this doesn't happen, I'm facing this issue in a US server.

>>> from GoogleNews import GoogleNews
>>> api = GoogleNews()
>>> api.setlang('en')
>>> api.setperiod('d')
>>> api.setencode('utf-8')
>>> api.search('Bitcoin')
>>> api.result()
[]

And if I try to clear the result and search other stuff, I still get the same empty list:

>>> api.clear()
>>> api.search('Amazon')
>>> api.result()
[]
>>> api.clear()
>>> api.search('APPLE')
>>> api.result()
[]

But if I search for news using get_news(), I'm currently getting the news as usual:

>>> api.clear()                                                                                                                                                                                                                               
>>> api.get_news('Tesla')                                                                                                                                                                                                                     
>>> api.results()                                                                                                                                                                                                                             
    [{'title': "Kelly Evans: We're all buying Tesla at the highs", 'desc': "My polite term for Tesla's valuation right now is “insane.” This is a company worth more than $550 billion. That's not only a staggering sum--making it the biggest ..
    .", 'date': '30 days ago', 'datetime': datetime.datetime(2020, 12, 1, 20, 2, 27), 'link': 'news.google.com/./articles/CAIiEFYt9BkrzON5lyF3cHkseAsqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen', 'img': 'https://lh3.google
    usercontent.com/g3nUNRA_3mBrV6g866OaK6tmy-ageNNgBKmnW86A3RVcBjxjd4-V-jZ6AdMrB15IVmJI0-50H8cWAiGLxNc=-p-df-h100-w100', 'media': None, 'site': 'CNBC'}
    ....]

Context (Environment)

I'm using GoogleNews from a virtualenv, this is what I have tried:

  • GoogleNews==1.5.1 & Python 3.6 Working only with get_news() function
  • GoogleNews==1.5.1 & Python 3.8 Working only with get_news() function
  • Ubuntu version:
    Distributor ID: Ubuntu
    Description:    Ubuntu 18.04.5 LTS
    Release:        18.04
    Codename:       bionic

Detailed Description

I would like to understand where the issue is happening and fix it on Python 3.6 and 3.8. It will be great if we can find a solution for this issue happening only on the search() function.

API seems to go on and offline when hit with heavy load

Over the last few days I observed the following.

April 11th: working all day, only doing small non consecutive searches
April 12th:

  • working with small non consecutive searches
  • (8pm ish) worked, hit API with consecutive (30 threads) searches (~60 searches, ~3000 links)
  • (9pm ish) returns nothing for simple test case
  • (11pm ish) returns results for simple test
  • (11pm+ ish) worked, hit API with consecutive (30 threads) searches (~60 searches, ~3000 links)
  • (11pm++ ish) returns nothing for simple test case

April 13th:

  • (9pm) worked, hit API with non consecutive (1 thread, processing time after each search) searches (~60 searches, ~5000 links)
  • (right after large search ~11:40pm) returns results for simple test
  • (11:50pm) attempted to hit the API with the same searches as 9pm but much less processing time between searches. Finished 8 searches before returning nothing for remaining searches

April 14th:
(noon): working, simple test

Are there some limits or guidelines we should follow while this project is under development?

duplicate result

I am getting a duplicate result on page 1

googlenews = GoogleNews()
googlenews.search('tailing dam')
googlenews.getpage(1)
news = googlenews.result()

Result
image

Could you please investigate the issue.

Date field is always empty

from GoogleNews import GoogleNews


#intialize search and saving
googlenews = GoogleNews()
googlenews.setlang('en')
googlenews.setperiod('d')
googlenews.search("tesla")

full_results= googlenews.result()

for result in full_results:
    print(result["date"])
    print(result["link"])

Local language

Hello, there's possible to search the news with my language(Thai)?
I've tried to search news with Thai language and it's returned [ ].

news only in english

Whenever i try to get data it downloads only data in english. Here is the code:

from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews = GoogleNews(lang = 'de')
googlenews = GoogleNews(start ='01/05/2019' ,end ='07/01/2019')
googlenews.search('Frankfurt')
result = googlenews.result()



for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, '\n', result[n][index])

print(len(result))

and here is part of output:

0
title 
 Penalty heartbreak for Eintracht Frankfurt as Chelsea book ...
media 
 Deutsche Welle
date 
 9 may 2019
desc 
 The German side came so close to the Europa League final, but finally fell to Chelsea on penalties. Europa League - FC Chelsea v Eintracht Frankfurt | ...
link 
 https://www.dw.com/en/penalty-heartbreak-for-eintracht-frankfurt-as-chelsea-book-english-europa-league-final/a-48678619
img 
 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTpPc81Q_BdksRefghIs5VOt_w4cnPiye5fKYNd3KMkv-RywSnEgnsqncYLUXTychL9hH5N83s&s
1
title 
 Germany: Russian millionaire killed in Frankfurt plane crash
media 
 Deutsche Welle
date 
 31 mar 2019
desc 
 One of Russia's richest women, Natalia Fileva, has died in a plane crash near Frankfurt, Germany. The cause of the accident was not immediately clear.
link 
 https://www.dw.com/en/germany-russian-millionaire-killed-in-frankfurt-plane-crash/a-48138292
img 
 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR1H3RncCLG4oX1aNgI9Mljc8Diti3_ILA_Nki-4TAJIF1qa1rWChdlDHQbHFqJKoIU0iRe5qY8&s

I am using Ubuntu 18.04.4 LTS. I tried to change my system locale to LANG=de_DE.UTF-8. Unfortunately, it didnt help... Thank you in advance.

Return NoneType

from GoogleNews import GoogleNews
googlenews = GoogleNews()

googlenews = GoogleNews(start='17/02/2021',end='22/02/2021')

news_keywords = ["Apple","Microsoft"etc]

check = ['17/02/2021','18/02/2021','19/02/2021','20/02/2021','21/02/2021','22/02/2021']
cols =['Date','Title','Description','provider','company']
lst = []
for i in news_keywords:
print(i)
googlenews.search(i)
googlenews.get_page(1)
googlenews.get_page(2)
etc

currently returns
Apple
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable

for multiple companies over this time period.
Have ran the same code for each week over the past months so unsure if mabye im doing something wrong with latest update?

Is the API still functioning

I used the API extensively for a few months. I came back it to it today after a few weeks and seem to not be returning any results. I could be blocked by some server on the way. Is the API still working for others?

Code used for test.

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.search('tesla')
if googlenews.result() == []:
    print("failed")
else:
    print("passed")
    print(googlenews.get__links())

De-amplify links

Add option to get direct links to articles instead of Google AMP links.

Date of article is not fetched properly

Instead of getting an exact date I get '1 month ago' in the results document. How can i fix that? Thank you for your help

from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
#config will allow us to access the specified url for which we are #not authorized. Sometimes we may get 403 client error while parsing #the link to download the article.
nltk.download('punkt')






user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 5
googlenews=GoogleNews(start='10/19/2020',end='10/19/2020')
googlenews.search('test')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,5):
    googlenews.getpage(i)
    result=googlenews.result()
    df=pd.DataFrame(result)
list=[]
for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    try:
        article.download()
        article.parse()
        article2 = article.text.split()
    except:
        print('***FAILED TO DOWNLOAD***', article.url)
        continue
    # article.download()
    # article.parse()
    article.nlp()


    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

Documentation requeseted

Hi,
Do I need to set the google API key?
I was trying and it returning null only.
Let me know what other details you might need.

numpy (np) import missing

Hi,

I believe you're missing an import statement -> in your 'define_date' function you refer to np, but it isn't imported as far as I can see.

see -> return float(np.nan)

Best regards,

Mike

Shortened titles

I encountered an issue with titles.

{'title': 'Crypto miners halt China business after Beijing cracks down ...',
'media': 'The Straits Times', 'date': '4 mins ago', 'datetime': None,
'desc': 'TOP, are suspending their China operations after Beijing stepped up its efforts to crack down on bitcoin mining and trading, sending the digital currency tumbling ...',
'link': 'https://www.straitstimes.com/business/companies-markets/bitcoin-volatility-puts-weekend-traders-on-stomach-churning-ride',
'img': 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=='}

As it can be seen in my provided code, the title is shortened. Can I know the reason and a way to "opt-out" of the shortening?

Search <start,end> functionality not working

Screenshot 2020-12-10 at 13 24 01

Start date seems to default to 7days ago? seems to work for dates previous to the start of december? any help would be appreciated to get data from 1st to 3rd

note : if search from 1st to 10th december, still only returns 3-10th

GoogleNews更新問題

更新之後原本關鍵字搜尋抓到的爬蟲新聞變成跟google新聞不同步,時間也亂跳,正常來說應該是抓取關鍵字查詢的新聞,可是更新之後都不知道抓到了什麼...
這是舊版本先前抓到的
googlenews
後來舊版本程式碼跑不出media跟 date
擷取
後來想說更新一下看media跟 date會不會出來,結果出來是出來了,可是新聞不同步
error

How to get the full news article??

Is there a way to get the full news in the description section or at least a paragraph instead of half sentences finishing with '...'????

Any success with fetching images !

Dear Author @HurinHu ,

Thanks for the package !
Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?

image

... see pull request

Hello,
you can see in my fork the algorithm I have used for retreiving the datetime from the news. It is based on this package. I have also included the descending order for the news among the set collected.

You can see from a new test script how the algorithm tackle rubbish from html (usually words ahead date, not parsable).

Immagine 2020-11-11 200123

ps. it can be improved further by removing copies of the same papers.

Multiple search terms

Hi,

I'm trying to web search multiple terms using GoolgeNews. My script used to work on older versions of GoolgeNews but it no longer works. It only searches the first term multiple times, and gives me repeating results.

Any helps is greatly appreciated. I need this fix by Wednesday for work.

from GoogleNews import GoogleNews
import sys
f = open("googlenews22.txt", "w")
keywordlist = ['apple', 'samsung', 'nokia']
googlenews = GoogleNews()
for word in keywordlist:
googlenews.search(word)
googlenews.setTimeRange('15/05/2020','15/06/2020')
googlenews.getpage(1)
results = googlenews.result()
listofres = []
for ting in word:
title = ting['title']
date = ting['date']
link = ting['link']
listofres += [[title, date, link]]
f.write("%s, %s, %s \n" %(title, date, link))

[Error] Missing/Empty data entries when deploying to railway.app

Hello author. I want to report this strange bug.

I tested on local with your library and still working normal. the "entries" part is still exists:
https://pastebin.com/xYs8pmkv

..........
'entries': [
    {
      'title': 'Angela Merkel tiêm vaccine: Liều một AstraZeneca, liều hai Moderna - BBC Tiếng Việt',
      'title_detail': {
        'type': 'text/plain',
        'language': None,
        'base': '',
        'value': 'Angela Merkel tiêm vaccine: Liều một AstraZeneca, liều hai Moderna - BBC Tiếng Việt'
      },
......

But when im trying to deployed on https://railway.app/, the entries somehow return an empty "entries"
The log: https://pastebin.com/Sj5ARuVN

....
 'entries': [
    // empty
  ]

The code im using

gn = GoogleNews(lang = 'vi')
    search = gn.search('Covid-19', when = '1h')
    print(search) 
    entries = search['entries']
    print(entries)

'NoneType' object has no attribute 'get'

Didn't get the news data which is available on google
googlenews.clear()

1.googlenews.search('vijay kumbhar')
2.'NoneType' object has no attribute 'get'
3. googlenews.result()
4. []
newsdata

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.