anaskhan96 / soup Goto Github PK
View Code? Open in Web Editor NEWWeb Scraper in Go, similar to BeautifulSoup
License: MIT License
Web Scraper in Go, similar to BeautifulSoup
License: MIT License
Does edit(add, update or delte tag, attr), export(string) in roadmap?
From the source code I see FindNextSibling
calls r.Pointer.NextSibling.NextSibling
which wrongly assumes NextSibling
should have another NextSibling
, and crash when it does not.
e.g.
const html = `<html>
<head>
<title>DOM Tutorial</title>
</head>
<body>
<a>DOM Lesson one</a><p>Hello world!</p>
</body>
</html>`
func main() {
doc := soup.HTMLParse(html)
link := doc.Find("a")
next := link.FindNextSibling()
fmt.Println(next.Text())
}
// $ panic: runtime error: invalid memory address or nil pointer dereference
This also applies for FindPrevSibling
.
BTW, I suggest there should be FindNextSibling
and FindNextSiblingElement
as the spec describes. (This might be another issue, I guess what you want to implement is FindNextSiblingElement
.)
I'm curious if there's a way to check if an element exists and have a Boolean returned if it does or does not exist rather than having the console just output something like
2017/06/06 11:21:52 Error occurred in Find() : Element `div` with attributes `class title` not found
Right now I suppose you would do this by checking if error was non-nil and then check the error to see if it contained "not found", which you would only know about if you read the source code of this project 😄
I think what I am proposing is to add something that does that check for you in the library. Maybe something like:
func (r Root) Empty() bool {
if r.Error == nil {
return false
}
return strings.Contains(string(r.Error), "not found")
}
Is this something other people would see as valuable? I would use it sorta like this:
main := doc.Find("section", "class", "gramb")
if main.Empty() {
return errors.New("No results for this query")
}
defs := main.FindAll("span", "class", "ind")
// Other processing here
Right now I'm just checking if main.Error is non nil and returning no results. Would just be nice (I think) to have a cleaner interface around it.
If you think this is worth doing I'd love to take a crack at it!
Thanks for this library, it's immensely helpful to my side project 😄
Hi,
Is it possible for you to replace log.Fatal instances with something else that returns an error instead?
It feels unfair if the entire program shuts down because soup couldn't find an element, and so on. I would rather like to handle the error when it cannot find something or it cannot parse html etc.
Thank you, been using this. :)
rsp, err := soup.Get(pSp.pageQueue[pSp.pageIndex])
if err != nil { log.Printf("get page : %s, err : %s", pSp.pageQueue[pSp.pageIndex], err) }
doc := soup.HTMLParse(rsp)
pageExist := doc.Find("div", "class", "page")
just like this
type Root struct {
Pointer *html.Node
NodeValue string
Error error
}
pageExist type is Root
Pointer maybe nil,so not suggest using like this
doc.Find("div", "class", "tags").FindAll("span", "class", "tag-item")
sometimes may cause panic
better use like this:
pageExist := doc.Find("div", "class", "page")
if pageExist.Pointer == nil { return//or do something else }
aLinks := pageExist.FindAll("a")
Hello,
How can I use the following python bs find on your soup? Thank you for your suggestion!
Find("div", "class", "toggleVisible")["id"]
Hi, I'm just starting out with go, so this question mgiht be dumb.
Is there a way, with this library to findall regular expression ?
If it is not implemented, will it be? or am I looking at the wrong package?
thanks
If the element is not found in the first child node, the value is returned, and the loop has no effect.
I think this should be if q {
for c := n.FirstChild; c != nil; c = c.NextSibling {
p, q := findOnce(c, args, true, strict)
if !q {
return p, q
}
}
Line 504 in cb47551
I want to get a element that contain other element. Such like this:
html:
<div id="view">
hello
<p>hello</p>
</div>
go:
doc := soup.HTMLParse(html)
text := doc.Find("div", "id", "view").Text()
fmt.Println(text)
In this sample, it just output "hello". I want it to output "hello<p>hello</p>". How can I do that?
Thanks for having a look.
Instead of using defer fetch.CatchPanic("Find()"). The functions should return errors especially when no data has been found.
An odd issue I'm having while trying to use soup to parse Fmylife's site for FMLs is when I get an FML that has the (&)nbsp; tag
<p class="block">
<a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
<span class="icon-piment"></span>
[Insert FML text here] FML
</a>
</p>
when I try to call the text, it returns blank text and nothing else.
I usually call it using .Find("p", "class", "block").Find("a").Text() and if it doesn't have the whitespace tag, it returns fine.
In python we can use:
soup.findAll(attrs={'class': None})
Hello. help me, please.
Code:
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"github.com/anaskhan96/soup"
)
func main() {
fmt.Printf("Enter the name of the city : ")
city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
city = city[:len(city)-1]
cityInURL := strings.Join(strings.Split(city, " "), "+")
url := "https://www.bing.com/search?q=weather+" + cityInURL
resp, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
doc := soup.HTMLParse(resp)
fmt.Println(doc)
grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
fmt.Println("grid = ", grid)
heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
GOROOT=C:\Go #gosetup
GOPATH=C:\Users\User\go\src\soup;C:\Users\User\go #gosetup
C:\Go\bin\go.exe build -o C:\Users\User\AppData\Local\Temp___go_build_weather_go.exe C:\Users\User\go\src\soup\weather.go #gosetup
C:\Users\User\AppData\Local\Temp___go_build_weather_go.exe #gosetup
Enter the name of the city : moscow
{0xc0002600e0 html }
grid = { element div
with attributes class b_antiTopBleed b_antiSideBleed b_antiBottomBleed
not found}
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x6628dc]
goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc000471e70, 0x3, 0x3, 0xc000190000, 0xc000471b58, 0x468093)
C:/Users/User/go/src/soup/src/github.com/anaskhan96/soup/soup.go:392 +0x31c
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x760a00, 0xc000153820, 0xc000471e70, 0x3, 0x3, 0x0, 0x0, ...)
C:/Users/User/go/src/soup/src/github.com/anaskhan96/soup/soup.go:167 +0x94
main.main()
C:/Users/User/go/src/soup/weather.go:27 +0x5ef
Process finished with exit code 2
I want to get the status code from the Http request to make sure the response from the website returns 200 Ok and continue the process.
Sometimes website returns 404
Any way to get the HTTP status code?
Just curious if soup has anything similar to how BeautifulSoup lets your parse HTML comments in Python?
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings
Trying to parse some HTML where some data is commented, and able to do the following in Python:
from bs4 import BeautifulSoup, Comment
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
comments_soup = BeautifulSoup(comment, 'lxml')
Is there anything close to that here? Or any chance or adding something like it?
I need to access a specific website through the native proxy port
thanks
Currently Find("a", "class", "message")
would only work if it was <a class="message"></a>
but would not work on <a class="message input-message"></a>
even though they are both of class message.
Could this be added?
package main
import (
"fmt"
"github.com/anaskhan96/soup"
"os"
)
func main() {
resp, err := soup.Get("https://slashdot.com")
if err != nil {
os.Exit(1)
}
doc := soup.HTMLParse(resp)
spans := doc.FindAll("span")
for _, span := range spans {
fmt.Println(span.Text())
}
}
Result:
$ ./test-span
Slashdot
Stories
Polls
Deals
Login
Sign up
RSS
Facebook
Google+
Twitter
Newsletter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x61aad7]
goroutine 1 [running]:
github.com/anaskhan96/soup.Root.Text(0xc420439030, 0x6bbff4, 0x4, 0x0, 0x0, 0x0, 0xa)
go/src/github.com/anaskhan96/soup/soup.go:257 +0xa7
main.main()
test-span.go:17 +0x1dd
nodeLinks
is a global variable in the file find.go
which is being initialised to a new slice of capacity 10 whenever FindAll()
in soup.go
is being called. This creates problems when the FindAll()
function is called concurrently in the driver program, as nodeLinks
keeps on getting initialised without fetching the nodes for either of the functions.
Let us assume we have such html
<body>
<div class="container">
<div this-attr="don't care about it's value at all">
</div>
</body>
</html>
And if we're searching like:
doc.Find("div", "this-attr")
It yields an error (I think it is an expected).
Function findOnce accesses the second argument 🤔
if uni == true {
if n.Type == html.ElementNode && matchElementName(n, args[0]) {
if len(args) > 1 && len(args) < 4 {
for i := 0; i < len(n.Attr); i++ {
attr := n.Attr[i]
searchAttrName := args[1]
searchAttrVal := args[2]
if (strict && attributeAndValueEquals(attr, searchAttrName, searchAttrVal)) ||
(!strict && attributeContainsValue(attr, searchAttrName, searchAttrVal)) {
return n, true
}
}
} else if len(args) == 1 {
return n, true
}
}
}
uni = true
for c := n.FirstChild; c != nil; c = c.NextSibling {
p, q := findOnce(c, args, true, strict)
if q != false {
return p, q
}
}
return nil, false
}
So my question is whether it's possible or not. Thank you!
Is there any way to get the equivalent of .HTML()
, but excluding the element's own markup (just like JS' .innerHMTL
), without having to resort to regex?
An example:
element.HTML()
yields <p><a href="square-cover-art.jpeg">My <em>wacky</em> label with <strong>bold</strong> and <code>code</code> and stuff “hmmm”</a></p>
I want to get <a href="square-cover-art.jpeg">My <em>wacky</em> label with <strong>bold</strong> and <code>code</code> and stuff “hmmm”</a>
I guess I could iterate over element.Children()
and concatenate each child's .HTML()
, but I think having a .InnerHTML()
would make things nicer (and a tad better when it comes to performance I guess)
I'm willing to make a PR :)
For example:
<div align="center">
<a href="search_3.asp?action=up">up</a>
<a href="search_3.asp?action=down">down</a>
(2021-9-20~2021-9-26)
</div>
Current, div.Text() only returns
, should it return (2021-9-20~2021-9-26)
will be better?
In order for proper selection it would be awesome to be able to navigate to the current elements Parent, then keep going through siblings and all.
Right now it is quite hard to properly find what I am looking for from a strict top-down view.
Hi!
If html is not in UTF-8, e.g.:
<meta http-equiv="content-type" content="text/html; charset=windows-1251">
please convert it to UTF-8.
Thanks
Hello! I'm Korean university student. If nothing is uncomfortable to you, I want to commit for 'soup' . I'm not a proffessional programer so, I can't commit difficult coding...
But, I can commit your repository More Examples Of Soup !
or, I can translate some guidelines.
If you are Okay to it, I can Do it for a month.. and I will pull request for you.
Is it Okay me to do Write Examples of 'Soup' or some translation for guidelines or your repository
Please reply to me !! :) I will work very hard!!
I'm not very skillful at English so, I'm sorry if you can't read this English..
Sincerly, lionking6792
It seems go mod/get is unable to understand shorter version strings
I forked your repo and changed it to 1.2.1 and go get did work again.
Currently I'm only able to get v1.1.1
Could you release a new version with 3 digits?
libraries should not catch panic() and then print to stdout.
If I have the following code
th := row.Find("th")
if th.Error == nil && th.Text() == "Service Expiry Date" {
...
}
and the call to row.Find("th")
returns a structure with nil FirstChild member of Pointer struct, then soup will panic, catch the panic, and print to stdout.
By catching the panic, it makes it super hard to figure out what is causing the panic because all I got when running my program was this:
2017/12/18 11:17:56 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference
however if I comment out the defer catchPanic("Text()")
call, then I get a much more helpful error:
DEBUG: DEBUG: &{0xc0422408c0 <nil> <nil> 0xc042240930 0xc042240a10 3 th th [{ align right} { width 200}]}
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x28 pc=0x688cdb]
goroutine 1 [running]:
github.com/anaskhan96/soup.Root.Text(0xc0422409a0, 0x7a7f94, 0x2, 0x0, 0x0, 0xc0420123f0, 0x63)
C:/Users/chrome/Development/go/src/github.com/anaskhan96/soup/soup.go:219 +0xbb
main.fetchDate(0xc042052400, 0x1f)
C:/Users/chrome/Development/go/src/gitlab.corp.xxx.com/se/shcheck/cmd/shcheck/main.go:96 +0x8e6
main.main()
C:/Users/chrome/Development/go/src/gitlab.corp.xxx.com/se/shcheck/cmd/shcheck/main.go:112 +0x23d
which then if I look at soup.go line 219, i see
k := r.Pointer.FirstChild
checkNode:
if k.Type != html.TextNode {
and it is now clear to me that r.Pointer.FirstChild is nil, and I need to check that it is not nil before calling Text().
However, you should be checking that value in your library, and returning string, error
, in my opinion.
When doing a FindAll(“td”), then calling Text() on a result, a null pointer error is thrown whenever an empty/nil value is encountered in the slice.
“runtime error: invalid memory address or nil pointer dereference
errorString”
An error object should be returned instead, or an empty string.
I have case where I want element have div or p I dont know how to make it probably its not possible with existing lib and we will need something FindOr
Great project, but only supports HTTP Get? Would be awesome to have Post too.
Trying to run the test weather.go in my machine and got this.
Enter the name of the city : Brisbane
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x665715]
goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x70207e, 0x13)
/home/stevek/go/src/github.com/anaskhan96/soup/soup.go:345 +0x315
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x75c820, 0xc000364040, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x0, ...)
/home/stevek/go/src/github.com/anaskhan96/soup/soup.go:121 +0x82
main.main()
/home/stevek/tmp/go-lang/src/weather.go:24 +0x49d
exit status 2
Hi, I tried your weather example and it always trows an "invalid memory address". I tried to reproduce the same bug with another website and it can actually search only those classes without any spaces inside of them. I don't know why but your parser stopped understanding spaces.
I added a fmt.Println() function in order to print the only class search with spaces (grid), that's the code:
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"github.com/anaskhan96/soup"
)
func main() {
fmt.Printf("Enter the name of the city : ")
city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
city = city[:len(city)-1]
cityInURL := strings.Join(strings.Split(city, " "), "+")
url := "https://www.bing.com/search?q=weather+" + cityInURL
resp, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
doc := soup.HTMLParse(resp)
grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
fmt.Println("Print grid:", grid)
heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
conditions := grid.Find("div", "class", "wtr_condition")
primaryCondition := conditions.Find("div")
secondaryCondition := primaryCondition.FindNextElementSibling()
temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
caption := secondaryCondition.Find("div").Text()
fmt.Println("City Name : " + heading)
fmt.Println("Temperature : " + temp + "˚C")
for _, i := range others {
fmt.Println(i.Text())
}
fmt.Println(caption)
}
And that's the output:
Enter the name of the city : New York
Print grid: {<nil> element `div` with attributes `class b_antiTopBleed b_antiSideBleed b_antiBottomBleed` not found}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x61d1f5]
goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc42005be68, 0x3, 0x3, 0xc420050000, 0x4aa247, 0xc420261e00)
/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:304 +0x315
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x6e1e60, 0xc420242070, 0xc42005be68, 0x3, 0x3, 0x0, 0x0, ...)
/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:120 +0x8d
main.main()
/home/fef0/Code/Go/Test/Test.go:26 +0x4e3
exit status 2
If you notice in the second line it was impossible to found the grid, but in facts it happens only because there are spaces in the class name.
I hope you can fix that as soon as possible, bye for now!
example: get p
's NodeValue text
<p>
<a class="btn" herf=""> </a>
text
</p>
Is there any ideal of getting a
's parent p
as root type. I don't see any record in the DFS.
Loving this library so far :-)
It would be really useful to be able to define our own headers, like user-agent for example.
Then I'd be able to use this for sites that require auth :-)
https://github.com/anaskhan96/soup/blob/v1.2.4/soup.go#L232
// PostForm is a convenience method for POST requests that
package main
import (
"fmt"
"log"
"net/http"
"time"
"github.com/anaskhan96/soup"
)
func main() {
go func() {
http.ListenAndServe(":12345", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprint(w, "OK")
}))
}()
time.Sleep(time.Second)
resp, err := soup.Get("http://127.0.0.1:12345/")
if err != nil {
log.Println("Error:", err.Error())
return
}
doc := soup.HTMLParse(resp)
r := doc.Find("Semething").Find("SomethingElse")
fmt.Println(r.Error)
}
Hello, If I try to chain Find
and FindAll
method of non-existent tags like in the example above, I get a panic error
$ go run .
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x66ce1b]
goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x6b64c0?, {0xc00011fe50?, 0x1, 0x1}, 0x2?, 0x0)
/home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:502 +0xfb
github.com/anaskhan96/soup.Root.Find({0x0, {0x0, 0x0}, {0x766ee0, 0xc000238030}}, {0xc00011fe50?, 0x1, 0x1})
/home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:268 +0xa5
main.main()
/home/alex/test/play3/main.go:24 +0x1ca
exit status 2
I believe that both func findOnce
and func findAllofem
should be checking if n *html.Node
is nil before proceeding with the processing.
Am I understanding this correctly?
Thanks,
Alex
Hi!
I have html files not using ids. With beautifulsoup it is easy to find such element using find("Some text"):
<span style="color: #012345">Some text</span>
Is the only way to find this to use FindAll("span") and then iterating through all found spans? In this case, how can I check whether a particular span element contains text? I wouldn't like to disable debugging, since, I guess, empty span is not necessary a critical error.
bs
allows you to use select
for using CSS selectors. Any such thing in this library?
fatal error: concurrent map iteration and map write
goroutine 833 [running]:
runtime.throw(0x7161cb, 0x26)
/root/.gvm/gos/go1.15.5/src/runtime/panic.go:1116 +0x72 fp=0xc000069938 sp=0xc000069908 pc=0x437312
runtime.mapiternext(0xc000069a10)
/root/.gvm/gos/go1.15.5/src/runtime/map.go:853 +0x554 fp=0xc0000699b8 sp=0xc000069938 pc=0x412574
github.com/anaskhan96/soup.setHeadersAndCookies(0xc0002e4600)
/root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:145 +0x87 fp=0xc000069b28 sp=0xc0000699b8 pc=0x6831a7
github.com/anaskhan96/soup.GetWithClient(0xc000288660, 0x24, 0xc0004b3da0, 0x0, 0x0, 0x0, 0x0)
/root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:117 +0x18b fp=0xc000069be0 sp=0xc000069b28 pc=0x682cab
This method was previously working but for some reason, it returns nil every single time now
//example
t, _ := soup.Get("https://google.com")
fmt.Println(soup.HTMLParse(t)) //prints {address <nil>}
Thanks for making an awesome package.
Go Modules can't get latest version of this package because version tag. It only can get 1.0.1, not 1.1.
Can you add new tag named 1.1.0?
i need proxy to connect the website
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.