rflechner / scrapysharp Goto Github PK

View Code? Open in Web Editor NEW

344.0 23.0 75.0 769 KB

reborn of https://bitbucket.org/rflechner/scrapysharp

License: MIT License

F# 9.72% HTML 10.70% C# 78.73% PowerShell 0.85%

dotnet html scraping scrapysharp scraper parsing csharp fsharp

scrapysharp's Introduction

Getting started

ScrapySharp has a Web Client able to simulate a real Web browser (handle referrer, cookies …)

Html parsing has to be as natural as possible. So I like to use CSS Selectors and Linq.

This framework wraps HtmlAgilityPack.

Basic examples of CssSelect usages

using System.Linq;
using HtmlAgilityPack;
using ScrapySharp.Extensions;

class Example
{
    public void Main()
    {
        var divs = html.CssSelect("div");  //all div elements
        var nodes = html.CssSelect("div.content"); //all div elements with css class ‘content’
        var nodes = html.CssSelect("div.widget.monthlist"); //all div elements with the both css class
        var nodes = html.CssSelect("#postPaging"); //all HTML elements with the id postPaging
        var nodes = html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass

        var nodes = html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class ‘content’

        var nodes = html.CssSelect("input[type=text].login"); // textbox with css class login
    }
}

Scrapysharp can also simulate a web browser

ScrapingBrowser browser = new ScrapingBrowser();

//set UseDefaultCookiesParser as false if a website returns invalid cookies format
//browser.UseDefaultCookiesParser = false;

WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

PageWebForm form = homePage.FindFormById("sb_form");
form["q"] = "scrapysharp";
form.Method = HttpVerb.Get;
WebPage resultsPage = form.Submit();

HtmlNode[] resultsLinks = resultsPage.Html.CssSelect("div.sb_tlst h3 a").ToArray();

WebPage blogPage = resultsPage.FindLinks(By.Text("romcyber blog | Just another WordPress site")).Single().Click();

Install Scrapysharp in your project

It's easy to use Scrapysharp in your project.

A Nuget package exists on nuget.org and on myget

News

Scrapysharp V3 is a reborn.

Old version under GPL license is still on bitbucket

Version 3 is a conversion to .net standard 2.0 and a relicensing.

scrapysharp's People

Contributors

Stargazers

Watchers

scrapysharp's Issues

302 Redirect after Login

I can successfully populate and submit a login form with Scrapy. As can be common with logins the server sends back a 302 redirect to a main page that Scrapy sees as an error (a WebException).

I tried enabling .AllowAutoRedirect as well as AllowMetaRedirect on the ScrapingBrowser object with no change in behavior.

Please advise.

Cookie error on web browser simulation example

The basic web browser simulation example provided in the readme throws me 2 System.AggregateException inner exceptions. 1: CookieException: An error occurred when parsing the Cookie header for Uri 'http://www.bing.com/'. and 2: CookieException: The 'Path'='/search' part of the cookie is invalid.

For reverence, here's the code in its entirety:

public static void Test()
{
	ScrapingBrowser browser = new ScrapingBrowser();

	//set UseDefaultCookiesParser as false if a website returns invalid cookies format
	//browser.UseDefaultCookiesParser = false;

	WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

	PageWebForm form = homePage.FindFormById("sb_form");
	form["q"] = "scrapysharp";
	form.Method = HttpVerb.Get;
	WebPage resultsPage = form.Submit();

	HtmlNode[] resultsLinks = resultsPage.Html.CssSelect("div.sb_tlst h3 a").ToArray();

	WebPage blogPage = resultsPage.FindLinks(By.Text("romcyber blog | Just another WordPress site")).Single().Click();
}

Documentation for this sucks

There is a button. I need to click it to submit the form. There is no click method when finding an element.

Failing Tests: Line endings on test HTML files are not forced to be CRLF.

When checking out, if your git config has autocrlf set to false or input then tests fail as they expect line feeds to be \r\n instead of just \n.

It's not an issue but a simple question

Please What ScrapySharp provides more than https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=netcore-3.1 ?
Where can I find documentation?
And what's the difference between HTMLAGILITY & ScrapySharp, it's not simply a wrapper I assume?

Thks

How would one download a video with ScrapySharp?

Hi!
I have a very specific case, where I want to download a video from streamtape

on this page there is a download button "downloadvideo" that if you click - it does a counter for a few seconds then it changes it href to the real download link

      var browser = new ScrapingBrowser();
      WebPage homePage = browser.NavigateToPage(new Uri("https://streamtape.com/v/9jzWrAdrJotDMD"));
      var firstDownloadButton = homePage.FindLinks(By.Id("downloadvideo"))
          .Single().Click();

? Don't know what to do next

Unable to Scrape Dynamic Content

I'm trying to use the ScrapingBrowser class to navigate and scrape a site. I noticed I was unable to enter the username and password on the login page because it is dynamically generated from Javascript. Is there a way to execute scripts on the page so I can get the full web page content?

NavigateToPage hangs

When I call NavigateToPage on https://portal.ryder.com, the call hangs. I have tried it both with AutoRedirect on and off using code that works on another site.

In Fiddler, I get from "/" to "/redirect" then to a second "/redirect" which then hangs.

Btw, thanks for the great code.

Issue parsing empty select

If there is an select input without options (populated later with script) form parser throws exception.
PageWebForm.cs
ParseFormFields
Here I've put some null checks for value.

var selects = from @select in node.CssSelect("select")
                          let name = @select.GetAttributeValue("name")
                          let option =
                              @select.CssSelect("option").FirstOrDefault(o => o.Attributes["selected"] != null) ??
                              @select.CssSelect("option").FirstOrDefault()
                          let value = (option == null) ? null : option.GetAttributeValue("value")
                          select new FormField
                          {
                              Name = name,
                              Value = string.IsNullOrEmpty(value) ? option == null ? "" : option.InnerText : value
                          };

Extend to allow for node lookups for multiple selectors

At present if I have the following html:

<html>
    <body>
        <p id="description">This is a description</p>
        <p id="secondary-description">This is a secondary description</p>
        <p id="not-a-description">This is a secondary description</p>        
    </body>
</html>

If you wanted to get description and secondary-description you need to do:

    var nodes = html.CssSelect("description");
    var nodes = html.CssSelect("secondary-description");

Ideally you could use a call like:

   var nodes = html.CssSelect(new string[] { "description", "secondary-description" });

Downloading resource files

How do you use ScrapingBrowser to download web resource files? I would love this feature but don't see how to save to disk.

For example, I am able to achieve this by adding to ScrapingBrowser:

    public WebResource DownloadWebResourceFile(Uri url, string path, FileMode mode)
    {
        var response = ExecuteRequest(url, HttpVerb.Get, new NameValueCollection());
        var stream = new FileStream(path, mode);
        var responseStream = response.GetResponseStream();

        if (responseStream != null)
            responseStream.CopyTo(stream);

        responseStream.Close();
        return new WebResource(stream, response.Headers["Last-Modified"], url, !IsCached(response.Headers["Cache-Control"]), response.ContentType);
    }

Thank you.

Is it possible to scrape dynamically rendered sites using ScrapySharp?

Working Timeout with Proxy

I am scraping a site that requires the use of hundreds / thousands of proxies to make ti through. I have a list of 20,000 proxies, but many are dead. When I encounter a dead proxy, I mark it and retry using a different one, but often times the timeout is around 15 seconds for a dead proxy. It would be nice if setting the browser.Timeout would handle that on Async calls especially.

browser.NavigateToPage hangs - 2020/04/10

Hi. I'm testing sample code. NavifateToPage hangs app.
Can you help me please?

ScrapingBrowser browser = new ScrapingBrowser();
WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

Consider support for SendKey or similar

For some actions (e.g. dismissing popup model), a simple SendKey("escape") would be handy.

FSharp does not support .net native UWP

Hello, it turns out that to publish a uwp app for windows store I need to compile with net native but the only package that gives me problems is ScrapySharp, Error GK0025 In assembly 'C: \ Users \ humbe \ source \ repos \ GraphPriceOne-key \ GraphPriceOne \ bin \ x64 \ Release \ ScrapySharp.Core.dll ': FSharp support is not yet implemented but' FSharp.Core 'is using it.

Could that 9% be passed to c #?

Could not install package from nuGet

Here is the error I get when try to install the package from nuGet or MyGet :

Install-Package : Could not install package 'ScrapySharp 3.0.0'. You are trying to install this package into a project that targets '.NETFramework,Version=v2.0', but the package does not contain any assembly references or content files that are 
compatible with that framework. For more information, contact the package author.
At line:1 char:1
+ Install-Package ScrapySharp -Version 3.0.0 -Source https://www.myget. ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Install-Package], Exception
    + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PackageManagement.PowerShellCmdlets.InstallPackageCommand

ScrapySharp.Network.Webpage Class

Hi,

I am new to ScrapySharp and have couple of questions that I am hoping to get the response promptly

What is the use of public void SaveSnapshot(string path)? An example will be of great help.
How can I take the snapshot of the whole web page?

--
Regards,
Omer Rasheed

Error while dockerize ScrapySharp

Hello, I have a problem when I try to dockerize an application that uses ScrapySharp, because in docker it throws me the following error "Unable to cast object of type 'System.Net.FileWebRequest' to type 'System.Net.HttpWebRequest'" in the line where "Click" function on hyperlink

suggest: support selector like ` :scope > a`

environment

netcoreapp2.2
dotnet add package ScrapySharp --version 3.0.0

details

invalid selector :scope > a

example

<div id="node">
some text <a>link</a> 
<span>span</span> some text <br/>
<p><a>link</a> </p>
</div>

I want select like &>a, or #node>a，but use (HtmlNode node).CssSelect("selector")

HtmlNode node = ....
node.CssSelect(":scope > a")

same like .tit>a:nth-child(2) ,
I can use LINQ like
var link= tit.ChildNodes.Where((e, idx) => e.Name == "a" && idx == 1).FirstOrDefault()
when use the selector, if can't find the element, easy way to throw exception and include the selector > but use linq need add more messages .

thanks

ScrapySharp is simple and powerful tools, this is functional recommendation,
Thank all the participants for their great work.

Body data is not being sent when using PUT verb (Scrapingbrowser)

I'm trying to send some data on the request body using PUT verb but is not working. Nothing is sent.

Does anyone know how to solve that?

This is my code:

data = "{\"assignee\":\"" + USER_NAME + "\"}";
browser.NavigateToPage(new Uri(URL), HttpVerb.Put, data);

When looking the request through fiddler i can see nothing is being passed on the body.

Tests: Assertion parameters in incorrect order

In a few places we see

    Assert.AreEqual(result.Length, 2);

Instead of

    Assert.AreEqual(2, result.Length);

ScrapySharp.Network.Webpage Class

Hi,

I am new to ScrapySharp and have couple of questions that I am hoping to get the response promptly

What is the use of public void SaveSnapshot(string path)? An example will be of great help.
How can I take the snapshot of the whole web page?

--
Regards,
Omer Rasheed

Cookie support and logging

I'm trying to navigate through a website and need to login. I have the idea that maybe cookies aren't stored/submitten on subsequent get/post operations. Unfortunately there is no easy way to see what is being submitted.

It would be helpful if this information would be logged in someway or that is is possible to read these structures.

Visual Render of the page called in ScrapingBrowser

Is it possible to view the visual render of the page called in ScrapingBrowser.
I need to manually interact with the page first, for example to log-in.

Any suggestions ?

Thanks in advance

Some css selector not supported

Unless I'm falling elsewhere some selector throw an exception: Invalid css selector syntax
The two that aren't working so far are "E + F" & "E:not(selector)".

Is there a list of supported selectors?

Update HTMLAgilityPack Dependency version

Currently this project only supports up to 1.7.4 but the latest version is many iterations past that at 1.11.7. for HTMLAgilityPack.

Is this project still maintained or is there a reason the latest dependency is incomparable with ScrapySharp?

Auto Detect Encoding can not work / Inaccurate results

Description

Auto Detect Encoding can not work
tp web Encoding is gb2312 / (936)
Inaccurate results

Test CODE Inaccurate results (AutoDetectCharsetEncoding )

var scrapingBrowser_demo001 = new ScrapySharp.Network.ScrapingBrowser()
{
 AutoDetectCharsetEncoding = true
};
ScrapySharp.Network.WebPage homePage_demo001 = 
scrapingBrowser_demo001.NavigateToPage(
new Uri("https://news.163.com/19/0717/11/EK9KFG4A0001885B.html"));

tks very much

now all fixed

This i have in standard .Net or i can install nuget package.
I have some problems with refferencing in Xamarin

Issue with cookie path

I am trying to scrape a server that returns me multiple Set-Cookie headers and the alternative parser does not work correctly.
The default one works but throws invalid cookie path because there is a cookie with a path.

This Uri construction here omits the path and supplies domain only for example https://example.com:443/ but if the cookie is with path for example /forum - cookieContainer.SetCookies throws exception with invalid path. Just passing the original url here fixed it for me. Not sure why that url construction was needed.

ScrapingBrowser.cs
private async Task GetWebResponseAsync(Uri url, HttpWebRequest request)

var cookieUrl =
                        new Uri(string.Format("{0}://{1}:{2}/", response.ResponseUri.Scheme, response.ResponseUri.Host,
                                              response.ResponseUri.Port));
         
                    if (UseDefaultCookiesParser)
                        cookieContainer.SetCookies(url, cookiesExpression);
                    else
                        SetCookies(url, cookiesExpression);

...

Regards,
Bojo