anaskhan96 / soup Goto Github PK

View Code? Open in Web Editor NEW

2.2K 37.0 168.0 102 KB

Web Scraper in Go, similar to BeautifulSoup

License: MIT License

Go 100.00%

golang go webscraper webscraping beautifulsoup web-scraper html-node

soup's Issues

Add String() function

Does edit(add, update or delte tag, attr), export(string) in roadmap?

FindNextSibling bug

From the source code I see FindNextSibling calls r.Pointer.NextSibling.NextSibling which wrongly assumes NextSibling should have another NextSibling, and crash when it does not.

e.g.

const html = `<html>

  <head>
      <title>DOM Tutorial</title>
  </head>

  <body>
      <a>DOM Lesson one</a><p>Hello world!</p>
  </body>

</html>`

func main() {
	doc := soup.HTMLParse(html)
	link := doc.Find("a")
	next := link.FindNextSibling()
	fmt.Println(next.Text())
}

// $ panic: runtime error: invalid memory address or nil pointer dereference

This also applies for FindPrevSibling.

BTW, I suggest there should be FindNextSibling and FindNextSiblingElement as the spec describes. (This might be another issue, I guess what you want to implement is FindNextSiblingElement.)

Check if element exists without triggering warnings in console?

I'm curious if there's a way to check if an element exists and have a Boolean returned if it does or does not exist rather than having the console just output something like

2017/06/06 11:21:52 Error occurred in Find() : Element `div` with attributes `class title` not found

Proposal: Add an "Empty" func to Root that would make it easier to tell when a query didn't return results

Right now I suppose you would do this by checking if error was non-nil and then check the error to see if it contained "not found", which you would only know about if you read the source code of this project 😄

I think what I am proposing is to add something that does that check for you in the library. Maybe something like:

func (r Root) Empty() bool {
    if r.Error == nil {
         return false
    }
    return strings.Contains(string(r.Error), "not found")
}

Is this something other people would see as valuable? I would use it sorta like this:

main := doc.Find("section", "class", "gramb")
if main.Empty() {
  return errors.New("No results for this query")
}
defs := main.FindAll("span", "class", "ind")
// Other processing here

Right now I'm just checking if main.Error is non nil and returning no results. Would just be nice (I think) to have a cleaner interface around it.

If you think this is worth doing I'd love to take a crack at it!

Thanks for this library, it's immensely helpful to my side project 😄

log.Fatal

Hi,

Is it possible for you to replace log.Fatal instances with something else that returns an error instead?

It feels unfair if the entire program shuts down because soup couldn't find an element, and so on. I would rather like to handle the error when it cannot find something or it cannot parse html etc.

Thank you, been using this. :)

Find() ret Root field value Pointer maybe nil

rsp, err := soup.Get(pSp.pageQueue[pSp.pageIndex])
if err != nil { log.Printf("get page : %s, err : %s", pSp.pageQueue[pSp.pageIndex], err) }
doc := soup.HTMLParse(rsp)
pageExist := doc.Find("div", "class", "page")
just like this
type Root struct {
Pointer *html.Node
NodeValue string
Error error
}
pageExist type is Root
Pointer maybe nil,so not suggest using like this
doc.Find("div", "class", "tags").FindAll("span", "class", "tag-item")
sometimes may cause panic
better use like this:
pageExist := doc.Find("div", "class", "page")
if pageExist.Pointer == nil { return//or do something else }
aLinks := pageExist.FindAll("a")

How can I find with class and get attribute?

Hello,

How can I use the following python bs find on your soup? Thank you for your suggestion!

Find("div", "class", "toggleVisible")["id"]

FindAll Regex

Hi, I'm just starting out with go, so this question mgiht be dumb.

Is there a way, with this library to findall regular expression ?

If it is not implemented, will it be? or am I looking at the wrong package?

thanks

findOnce break after the first child node.

If the element is not found in the first child node, the value is returned, and the loop has no effect.

I think this should be if q {

for c := n.FirstChild; c != nil; c = c.NextSibling {
	p, q := findOnce(c, args, true, strict)
	if !q {
		return p, q
	}
}

soup/soup.go

Line 504 in cb47551

if q != false {

how can I get the nested element?

I want to get a element that contain other element. Such like this:
html:

<div id="view">
hello
<p>hello</p>
</div>

go:

doc := soup.HTMLParse(html)
text := doc.Find("div", "id", "view").Text()
fmt.Println(text)

In this sample, it just output "hello". I want it to output "hellohello". How can I do that?
Thanks for having a look.

All the functions should return errors

Instead of using defer fetch.CatchPanic("Find()"). The functions should return errors especially when no data has been found.

  causes no text to be returned

An odd issue I'm having while trying to use soup to parse Fmylife's site for FMLs is when I get an FML that has the (&)nbsp; tag

<p class="block">
<a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
<span class="icon-piment"></span>&nbsp;
[Insert FML text here] FML
</a>
</p>

when I try to call the text, it returns blank text and nothing else.

I usually call it using .Find("p", "class", "block").Find("a").Text() and if it doesn't have the whitespace tag, it returns fine.

FindAll element with no class attribute

In python we can use:

soup.findAll(attrs={'class': None})

In the examples file weather.go element not found

Hello. help me, please.
Code:
package main

import (
"bufio"
"fmt"
"log"
"os"
"strings"

"github.com/anaskhan96/soup"
)

func main() {
fmt.Printf("Enter the name of the city : ")
city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
city = city[:len(city)-1]
cityInURL := strings.Join(strings.Split(city, " "), "+")
url := "https://www.bing.com/search?q=weather+" + cityInURL
resp, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
doc := soup.HTMLParse(resp)
fmt.Println(doc)
grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
fmt.Println("grid = ", grid)
heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()

GOROOT=C:\Go #gosetup
GOPATH=C:\Users\User\go\src\soup;C:\Users\User\go #gosetup
C:\Go\bin\go.exe build -o C:\Users\User\AppData\Local\Temp___go_build_weather_go.exe C:\Users\User\go\src\soup\weather.go #gosetup
C:\Users\User\AppData\Local\Temp___go_build_weather_go.exe #gosetup
Enter the name of the city : moscow
{0xc0002600e0 html }
grid = { element div with attributes class b_antiTopBleed b_antiSideBleed b_antiBottomBleed not found}
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x6628dc]

goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc000471e70, 0x3, 0x3, 0xc000190000, 0xc000471b58, 0x468093)
C:/Users/User/go/src/soup/src/github.com/anaskhan96/soup/soup.go:392 +0x31c
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x760a00, 0xc000153820, 0xc000471e70, 0x3, 0x3, 0x0, 0x0, ...)
C:/Users/User/go/src/soup/src/github.com/anaskhan96/soup/soup.go:167 +0x94
main.main()
C:/Users/User/go/src/soup/weather.go:27 +0x5ef

Process finished with exit code 2

How to get http status code?

I want to get the status code from the Http request to make sure the response from the website returns 200 Ok and continue the process.
Sometimes website returns 404

Any way to get the HTTP status code?

Anything akin to BeautifulSoup's Comment?

Just curious if soup has anything similar to how BeautifulSoup lets your parse HTML comments in Python?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

Trying to parse some HTML where some data is commented, and able to do the following in Python:

from bs4 import BeautifulSoup, Comment
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
comments_soup = BeautifulSoup(comment, 'lxml')

Is there anything close to that here? Or any chance or adding something like it?

Can anyone tell me how to send request through proxy？

I need to access a specific website through the native proxy port
thanks

i found a bug in your soup

Find by single class

Currently Find("a", "class", "message") would only work if it was <a class="message"></a> but would not work on <a class="message input-message"></a> even though they are both of class message.

Could this be added?

Crash accessing results of FindAll("span")

package main

import (
        "fmt"
        "github.com/anaskhan96/soup"
        "os"
)

func main() {
        resp, err := soup.Get("https://slashdot.com")
        if err != nil {
                os.Exit(1)
        }
        doc := soup.HTMLParse(resp)
        spans := doc.FindAll("span")
        for _, span := range spans {
                fmt.Println(span.Text())
        }
}

Result:

$ ./test-span
Slashdot
Stories
Polls
Deals
 Login
 Sign up
RSS
Facebook
Google+
Twitter
Newsletter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x61aad7]

goroutine 1 [running]:
github.com/anaskhan96/soup.Root.Text(0xc420439030, 0x6bbff4, 0x4, 0x0, 0x0, 0x0, 0xa)
	go/src/github.com/anaskhan96/soup/soup.go:257 +0xa7
main.main()
	test-span.go:17 +0x1dd

Remove global variable in find.go

nodeLinks is a global variable in the file find.go which is being initialised to a new slice of capacity 10 whenever FindAll() in soup.go is being called. This creates problems when the FindAll() function is called concurrently in the driver program, as nodeLinks keeps on getting initialised without fetching the nodes for either of the functions.

Is this possible to search by some attribute?

Let us assume we have such html

<body>
<div class="container">
    <div this-attr="don't care about it's value at all">
</div>
</body>
</html>

And if we're searching like:
doc.Find("div", "this-attr")
It yields an error (I think it is an expected).

Function findOnce accesses the second argument 🤔

	if uni == true {
		if n.Type == html.ElementNode && matchElementName(n, args[0]) {
			if len(args) > 1 && len(args) < 4 {
				for i := 0; i < len(n.Attr); i++ {
					attr := n.Attr[i]
					searchAttrName := args[1]
					searchAttrVal := args[2]
					if (strict && attributeAndValueEquals(attr, searchAttrName, searchAttrVal)) ||
						(!strict && attributeContainsValue(attr, searchAttrName, searchAttrVal)) {
						return n, true
					}
				}
			} else if len(args) == 1 {
				return n, true
			}
		}
	}
	uni = true
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		p, q := findOnce(c, args, true, strict)
		if q != false {
			return p, q
		}
	}
	return nil, false
}

So my question is whether it's possible or not. Thank you!

InnerHTML

Is there any way to get the equivalent of .HTML(), but excluding the element's own markup (just like JS' .innerHMTL), without having to resort to regex?

An example:

element.HTML() yields <a href="square-cover-art.jpeg">My wacky label with bold and <code>code</code> and stuff “hmmm”</a>

I want to get <a href="square-cover-art.jpeg">My wacky label with bold and <code>code</code> and stuff “hmmm”</a>

I guess I could iterate over element.Children() and concatenate each child's .HTML(), but I think having a .InnerHTML() would make things nicer (and a tad better when it comes to performance I guess)

I'm willing to make a PR :)

Should Text() return all sibling text?

For example:

<div align="center">
<a href="search_3.asp?action=up">up</a>
&nbsp;
<a href="search_3.asp?action=down">down</a>
(2021-9-20~2021-9-26）
</div>

Current, div.Text() only returns  , should it return  (2021-9-20~2021-9-26) will be better?

Navigating to Parent

In order for proper selection it would be awesome to be able to navigate to the current elements Parent, then keep going through siblings and all.

Right now it is quite hard to properly find what I am looking for from a strict top-down view.

Please convert output according to charset specified in html

Hi!

If html is not in UTF-8, e.g.:

<meta http-equiv="content-type" content="text/html; charset=windows-1251">

please convert it to UTF-8.

Thanks

Hello~ I'm Student in KwangWoon University city of Seoul. Korea

Hello! I'm Korean university student. If nothing is uncomfortable to you, I want to commit for 'soup' . I'm not a proffessional programer so, I can't commit difficult coding...

But, I can commit your repository More Examples Of Soup !
or, I can translate some guidelines.

If you are Okay to it, I can Do it for a month.. and I will pull request for you.
Is it Okay me to do Write Examples of 'Soup' or some translation for guidelines or your repository

Please reply to me !! :) I will work very hard!!

I'm not very skillful at English so, I'm sorry if you can't read this English..

Sincerly, lionking6792

go.mod: no matching versions for query "v1.2"

It seems go mod/get is unable to understand shorter version strings

I forked your repo and changed it to 1.2.1 and go get did work again.
Currently I'm only able to get v1.1.1

Could you release a new version with 3 digits?

catchPanic() prints to stdout on some semi-valid cases

libraries should not catch panic() and then print to stdout.

If I have the following code

th := row.Find("th")
if th.Error == nil && th.Text() == "Service Expiry Date" {
    ...
}

and the call to row.Find("th") returns a structure with nil FirstChild member of Pointer struct, then soup will panic, catch the panic, and print to stdout.

By catching the panic, it makes it super hard to figure out what is causing the panic because all I got when running my program was this:

2017/12/18 11:17:56 Error occurred in Text() : runtime error: invalid memory address or nil pointer dereference

however if I comment out the defer catchPanic("Text()") call, then I get a much more helpful error:

DEBUG: DEBUG: &{0xc0422408c0 <nil> <nil> 0xc042240930 0xc042240a10 3 th th  [{ align right} { width 200}]}
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x28 pc=0x688cdb]

goroutine 1 [running]:
github.com/anaskhan96/soup.Root.Text(0xc0422409a0, 0x7a7f94, 0x2, 0x0, 0x0, 0xc0420123f0, 0x63)
        C:/Users/chrome/Development/go/src/github.com/anaskhan96/soup/soup.go:219 +0xbb
main.fetchDate(0xc042052400, 0x1f)
        C:/Users/chrome/Development/go/src/gitlab.corp.xxx.com/se/shcheck/cmd/shcheck/main.go:96 +0x8e6
main.main()
        C:/Users/chrome/Development/go/src/gitlab.corp.xxx.com/se/shcheck/cmd/shcheck/main.go:112 +0x23d

which then if I look at soup.go line 219, i see

	k := r.Pointer.FirstChild
checkNode:
	if k.Type != html.TextNode {

and it is now clear to me that r.Pointer.FirstChild is nil, and I need to check that it is not nil before calling Text().

However, you should be checking that value in your library, and returning string, error, in my opinion.

Empty strings throw errors on Text()

When doing a FindAll(“td”), then calling Text() on a result, a null pointer error is thrown whenever an empty/nil value is encountered in the slice.

“runtime error: invalid memory address or nil pointer dereference
errorString”

An error object should be returned instead, or an empty string.

Find Or

I have case where I want element have div or p I dont know how to make it probably its not possible with existing lib and we will need something FindOr

Add a Post() function

Great project, but only supports HTTP Get? Would be awesome to have Post too.

Crashed with SIGSEGV

Trying to run the test weather.go in my machine and got this.

Enter the name of the city : Brisbane
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x665715]

goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x70207e, 0x13)
/home/stevek/go/src/github.com/anaskhan96/soup/soup.go:345 +0x315
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x75c820, 0xc000364040, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x0, ...)
/home/stevek/go/src/github.com/anaskhan96/soup/soup.go:121 +0x82
main.main()
/home/stevek/tmp/go-lang/src/weather.go:24 +0x49d
exit status 2

[BUG]: Search classes with spaces fails every time (even in the weather example you provided)

Hi, I tried your weather example and it always trows an "invalid memory address". I tried to reproduce the same bug with another website and it can actually search only those classes without any spaces inside of them. I don't know why but your parser stopped understanding spaces.
I added a fmt.Println() function in order to print the only class search with spaces (grid), that's the code:

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"strings"

	"github.com/anaskhan96/soup"
)

func main() {
	fmt.Printf("Enter the name of the city : ")
	city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
	city = city[:len(city)-1]
	cityInURL := strings.Join(strings.Split(city, " "), "+")
	url := "https://www.bing.com/search?q=weather+" + cityInURL
	resp, err := soup.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	doc := soup.HTMLParse(resp)
	grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
	fmt.Println("Print grid:", grid)
	heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
	conditions := grid.Find("div", "class", "wtr_condition")
	primaryCondition := conditions.Find("div")
	secondaryCondition := primaryCondition.FindNextElementSibling()
	temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
	others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
	caption := secondaryCondition.Find("div").Text()
	fmt.Println("City Name : " + heading)
	fmt.Println("Temperature : " + temp + "˚C")
	for _, i := range others {
		fmt.Println(i.Text())
	}
	fmt.Println(caption)
}

And that's the output:

Enter the name of the city : New York
Print grid: {<nil>  element `div` with attributes `class b_antiTopBleed b_antiSideBleed b_antiBottomBleed` not found}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x61d1f5]

goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x0, 0xc42005be68, 0x3, 0x3, 0xc420050000, 0x4aa247, 0xc420261e00)
	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:304 +0x315
github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x6e1e60, 0xc420242070, 0xc42005be68, 0x3, 0x3, 0x0, 0x0, ...)
	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:120 +0x8d
main.main()
	/home/fef0/Code/Go/Test/Test.go:26 +0x4e3
exit status 2

If you notice in the second line it was impossible to found the grid, but in facts it happens only because there are spaces in the class name.
I hope you can fix that as soon as possible, bye for now!

how to get element's parent node

example: get p's NodeValue text

<p>
    <a class="btn" herf=""> </a>
    text
</p>

Is there any ideal of getting a's parent p as root type. I don't see any record in the DFS.

Feature Request: User Defined Headers

Loving this library so far :-)

It would be really useful to be able to define our own headers, like user-agent for example.

Then I'd be able to use this for sites that require auth :-)

incomplete docs (PostForm)

https://github.com/anaskhan96/soup/blob/v1.2.4/soup.go#L232

// PostForm is a convenience method for POST requests that

invalid memory address or nil pointer dereference when chaining methods

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"

	"github.com/anaskhan96/soup"
)

func main() {
	go func() {
		http.ListenAndServe(":12345", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprint(w, "OK")
		}))
	}()

	time.Sleep(time.Second)

	resp, err := soup.Get("http://127.0.0.1:12345/")
	if err != nil {
		log.Println("Error:", err.Error())
		return
	}

	doc := soup.HTMLParse(resp)
	r := doc.Find("Semething").Find("SomethingElse")
	fmt.Println(r.Error)
}

Hello, If I try to chain Find and FindAll method of non-existent tags like in the example above, I get a panic error

$ go run .
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x66ce1b]

goroutine 1 [running]:
github.com/anaskhan96/soup.findOnce(0x6b64c0?, {0xc00011fe50?, 0x1, 0x1}, 0x2?, 0x0)
        /home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:502 +0xfb
github.com/anaskhan96/soup.Root.Find({0x0, {0x0, 0x0}, {0x766ee0, 0xc000238030}}, {0xc00011fe50?, 0x1, 0x1})
        /home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:268 +0xa5
main.main()
        /home/alex/test/play3/main.go:24 +0x1ca
exit status 2

I believe that both func findOnce and func findAllofem should be checking if n *html.Node is nil before proceeding with the processing.
Am I understanding this correctly?

Thanks,
Alex

How to find an element with particular value?

Hi!

I have html files not using ids. With beautifulsoup it is easy to find such element using find("Some text"):

<span style="color: #012345">Some text</span>

Is the only way to find this to use FindAll("span") and then iterating through all found spans? In this case, how can I check whether a particular span element contains text? I wouldn't like to disable debugging, since, I guess, empty span is not necessary a critical error.

How to use selectors?

bs allows you to use select for using CSS selectors. Any such thing in this library?

fatal error: concurrent map iteration and map write

fatal error: concurrent map iteration and map write

goroutine 833 [running]:
runtime.throw(0x7161cb, 0x26)
        /root/.gvm/gos/go1.15.5/src/runtime/panic.go:1116 +0x72 fp=0xc000069938 sp=0xc000069908 pc=0x437312
runtime.mapiternext(0xc000069a10)
        /root/.gvm/gos/go1.15.5/src/runtime/map.go:853 +0x554 fp=0xc0000699b8 sp=0xc000069938 pc=0x412574 
github.com/anaskhan96/soup.setHeadersAndCookies(0xc0002e4600)
        /root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:145 +0x87 fp=0xc000069b28 sp=0xc0000699b8 pc=0x6831a7
github.com/anaskhan96/soup.GetWithClient(0xc000288660, 0x24, 0xc0004b3da0, 0x0, 0x0, 0x0, 0x0)
        /root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:117 +0x18b fp=0xc000069be0 sp=0xc000069b28 pc=0x682cab

soup.HTMLParse() returning nil

This method was previously working but for some reason, it returns nil every single time now

//example
t, _ := soup.Get("https://google.com")
fmt.Println(soup.HTMLParse(t)) //prints {address <nil>}

Versioning

Thanks for making an awesome package.

Go Modules can't get latest version of this package because version tag. It only can get 1.0.1, not 1.1.
Can you add new tag named 1.1.0?

how to set proxy with get request?

i need proxy to connect the website

anaskhan96 / soup Goto Github PK

soup's Issues

Recommend Projects

Recommend Topics

Recommend Org