Giter Club home page Giter Club logo

scrapysharp's Introduction

Getting started

ScrapySharp has a Web Client able to simulate a real Web browser (handle referrer, cookies …)

Html parsing has to be as natural as possible. So I like to use CSS Selectors and Linq.

This framework wraps HtmlAgilityPack.

Basic examples of CssSelect usages

using System.Linq;
using HtmlAgilityPack;
using ScrapySharp.Extensions;

class Example
{
    public void Main()
    {
        var divs = html.CssSelect("div");  //all div elements
        var nodes = html.CssSelect("div.content"); //all div elements with css class ‘content’
        var nodes = html.CssSelect("div.widget.monthlist"); //all div elements with the both css class
        var nodes = html.CssSelect("#postPaging"); //all HTML elements with the id postPaging
        var nodes = html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass

        var nodes = html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class ‘content’

        var nodes = html.CssSelect("input[type=text].login"); // textbox with css class login
    }
}

Scrapysharp can also simulate a web browser

ScrapingBrowser browser = new ScrapingBrowser();

//set UseDefaultCookiesParser as false if a website returns invalid cookies format
//browser.UseDefaultCookiesParser = false;

WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

PageWebForm form = homePage.FindFormById("sb_form");
form["q"] = "scrapysharp";
form.Method = HttpVerb.Get;
WebPage resultsPage = form.Submit();

HtmlNode[] resultsLinks = resultsPage.Html.CssSelect("div.sb_tlst h3 a").ToArray();

WebPage blogPage = resultsPage.FindLinks(By.Text("romcyber blog | Just another WordPress site")).Single().Click();

Install Scrapysharp in your project

It's easy to use Scrapysharp in your project.

A Nuget package exists on nuget.org and on myget

News

Scrapysharp V3 is a reborn.

Old version under GPL license is still on bitbucket

Version 3 is a conversion to .net standard 2.0 and a relicensing.

scrapysharp's People

Contributors

dararish avatar gregclout avatar jan-tee avatar rflechner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapysharp's Issues

302 Redirect after Login

I can successfully populate and submit a login form with Scrapy. As can be common with logins the server sends back a 302 redirect to a main page that Scrapy sees as an error (a WebException).

I tried enabling .AllowAutoRedirect as well as AllowMetaRedirect on the ScrapingBrowser object with no change in behavior.

Please advise.

Cookie error on web browser simulation example

The basic web browser simulation example provided in the readme throws me 2 System.AggregateException inner exceptions. 1: CookieException: An error occurred when parsing the Cookie header for Uri 'http://www.bing.com/'. and 2: CookieException: The 'Path'='/search' part of the cookie is invalid.

For reverence, here's the code in its entirety:

public static void Test()
{
	ScrapingBrowser browser = new ScrapingBrowser();

	//set UseDefaultCookiesParser as false if a website returns invalid cookies format
	//browser.UseDefaultCookiesParser = false;

	WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

	PageWebForm form = homePage.FindFormById("sb_form");
	form["q"] = "scrapysharp";
	form.Method = HttpVerb.Get;
	WebPage resultsPage = form.Submit();

	HtmlNode[] resultsLinks = resultsPage.Html.CssSelect("div.sb_tlst h3 a").ToArray();

	WebPage blogPage = resultsPage.FindLinks(By.Text("romcyber blog | Just another WordPress site")).Single().Click();
}

How would one download a video with ScrapySharp?

Hi!
I have a very specific case, where I want to download a video from streamtape

  • on this page there is a download button "downloadvideo" that if you click - it does a counter for a few seconds then it changes it href to the real download link

          var browser = new ScrapingBrowser();
          WebPage homePage = browser.NavigateToPage(new Uri("https://streamtape.com/v/9jzWrAdrJotDMD"));
          var firstDownloadButton = homePage.FindLinks(By.Id("downloadvideo"))
              .Single().Click();
    

? Don't know what to do next

Unable to Scrape Dynamic Content

I'm trying to use the ScrapingBrowser class to navigate and scrape a site. I noticed I was unable to enter the username and password on the login page because it is dynamically generated from Javascript. Is there a way to execute scripts on the page so I can get the full web page content?

NavigateToPage hangs

When I call NavigateToPage on https://portal.ryder.com, the call hangs. I have tried it both with AutoRedirect on and off using code that works on another site.

In Fiddler, I get from "/" to "/redirect" then to a second "/redirect" which then hangs.

Btw, thanks for the great code.

Issue parsing empty select

If there is an select input without options (populated later with script) form parser throws exception.
PageWebForm.cs
ParseFormFields
Here I've put some null checks for value.

var selects = from @select in node.CssSelect("select")
                          let name = @select.GetAttributeValue("name")
                          let option =
                              @select.CssSelect("option").FirstOrDefault(o => o.Attributes["selected"] != null) ??
                              @select.CssSelect("option").FirstOrDefault()
                          let value = (option == null) ? null : option.GetAttributeValue("value")
                          select new FormField
                          {
                              Name = name,
                              Value = string.IsNullOrEmpty(value) ? option == null ? "" : option.InnerText : value
                          };

Extend to allow for node lookups for multiple selectors

At present if I have the following html:

<html>
    <body>
        <p id="description">This is a description</p>
        <p id="secondary-description">This is a secondary description</p>
        <p id="not-a-description">This is a secondary description</p>        
    </body>
</html>

If you wanted to get description and secondary-description you need to do:

    var nodes = html.CssSelect("description");
    var nodes = html.CssSelect("secondary-description");

Ideally you could use a call like:

   var nodes = html.CssSelect(new string[] { "description", "secondary-description" });

Downloading resource files

How do you use ScrapingBrowser to download web resource files? I would love this feature but don't see how to save to disk.

For example, I am able to achieve this by adding to ScrapingBrowser:

    public WebResource DownloadWebResourceFile(Uri url, string path, FileMode mode)
    {
        var response = ExecuteRequest(url, HttpVerb.Get, new NameValueCollection());
        var stream = new FileStream(path, mode);
        var responseStream = response.GetResponseStream();

        if (responseStream != null)
            responseStream.CopyTo(stream);

        responseStream.Close();
        return new WebResource(stream, response.Headers["Last-Modified"], url, !IsCached(response.Headers["Cache-Control"]), response.ContentType);
    }

Thank you.

Working Timeout with Proxy

I am scraping a site that requires the use of hundreds / thousands of proxies to make ti through. I have a list of 20,000 proxies, but many are dead. When I encounter a dead proxy, I mark it and retry using a different one, but often times the timeout is around 15 seconds for a dead proxy. It would be nice if setting the browser.Timeout would handle that on Async calls especially.

FSharp does not support .net native UWP

Hello, it turns out that to publish a uwp app for windows store I need to compile with net native but the only package that gives me problems is ScrapySharp, Error GK0025 In assembly 'C: \ Users \ humbe \ source \ repos \ GraphPriceOne-key \ GraphPriceOne \ bin \ x64 \ Release \ ScrapySharp.Core.dll ': FSharp support is not yet implemented but' FSharp.Core 'is using it.

Could that 9% be passed to c #?

Could not install package from nuGet

Here is the error I get when try to install the package from nuGet or MyGet :

Install-Package : Could not install package 'ScrapySharp 3.0.0'. You are trying to install this package into a project that targets '.NETFramework,Version=v2.0', but the package does not contain any assembly references or content files that are 
compatible with that framework. For more information, contact the package author.
At line:1 char:1
+ Install-Package ScrapySharp -Version 3.0.0 -Source https://www.myget. ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Install-Package], Exception
    + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PackageManagement.PowerShellCmdlets.InstallPackageCommand

ScrapySharp.Network.Webpage Class

Hi,

I am new to ScrapySharp and have couple of questions that I am hoping to get the response promptly

  1. What is the use of public void SaveSnapshot(string path)? An example will be of great help.
  2. How can I take the snapshot of the whole web page?

--
Regards,
Omer Rasheed

Error while dockerize ScrapySharp

Hello, I have a problem when I try to dockerize an application that uses ScrapySharp, because in docker it throws me the following error "Unable to cast object of type 'System.Net.FileWebRequest' to type 'System.Net.HttpWebRequest'" in the line where "Click" function on hyperlink
obraz_2023-02-28_092022505

suggest: support selector like ` :scope > a`

environment

  • netcoreapp2.2
  • dotnet add package ScrapySharp --version 3.0.0

details

invalid selector :scope > a

example

<div id="node">
some text <a>link</a> 
<span>span</span> some text <br/>
<p><a>link</a> </p>
</div>

I want select like &>a, or #node>a,but use (HtmlNode node).CssSelect("selector")

HtmlNode node = ....
node.CssSelect(":scope > a")

same like .tit>a:nth-child(2) ,
I can use LINQ like
var link= tit.ChildNodes.Where((e, idx) => e.Name == "a" && idx == 1).FirstOrDefault()
when use the selector, if can't find the element, easy way to throw exception and include the selector > but use linq need add more messages .

thanks

ScrapySharp is simple and powerful tools, this is functional recommendation,
Thank all the participants for their great work.

Body data is not being sent when using PUT verb (Scrapingbrowser)

I'm trying to send some data on the request body using PUT verb but is not working. Nothing is sent.

Does anyone know how to solve that?

This is my code:

data = "{\"assignee\":\"" + USER_NAME + "\"}";
browser.NavigateToPage(new Uri(URL), HttpVerb.Put, data);

When looking the request through fiddler i can see nothing is being passed on the body.

ScrapySharp.Network.Webpage Class

Hi,

I am new to ScrapySharp and have couple of questions that I am hoping to get the response promptly

  1. What is the use of public void SaveSnapshot(string path)? An example will be of great help.
  2. How can I take the snapshot of the whole web page?

--
Regards,
Omer Rasheed

Cookie support and logging

I'm trying to navigate through a website and need to login. I have the idea that maybe cookies aren't stored/submitten on subsequent get/post operations. Unfortunately there is no easy way to see what is being submitted.

It would be helpful if this information would be logged in someway or that is is possible to read these structures.

Some css selector not supported

Unless I'm falling elsewhere some selector throw an exception: Invalid css selector syntax
The two that aren't working so far are "E + F" & "E:not(selector)".

Is there a list of supported selectors?

Update HTMLAgilityPack Dependency version

Currently this project only supports up to 1.7.4 but the latest version is many iterations past that at 1.11.7. for HTMLAgilityPack.

Is this project still maintained or is there a reason the latest dependency is incomparable with ScrapySharp?

Auto Detect Encoding can not work / Inaccurate results

Description

Auto Detect Encoding can not work
tp web Encoding is gb2312 / (936)
Inaccurate results

Test CODE Inaccurate results (AutoDetectCharsetEncoding )

var scrapingBrowser_demo001 = new ScrapySharp.Network.ScrapingBrowser()
{
 AutoDetectCharsetEncoding = true
};
ScrapySharp.Network.WebPage homePage_demo001 = 
scrapingBrowser_demo001.NavigateToPage(
new Uri("https://news.163.com/19/0717/11/EK9KFG4A0001885B.html"));

tks very much

now all fixed

This i have in standard .Net or i can install nuget package.
I have some problems with refferencing in Xamarin

Issue with cookie path

I am trying to scrape a server that returns me multiple Set-Cookie headers and the alternative parser does not work correctly.
The default one works but throws invalid cookie path because there is a cookie with a path.

This Uri construction here omits the path and supplies domain only for example https://example.com:443/ but if the cookie is with path for example /forum - cookieContainer.SetCookies throws exception with invalid path. Just passing the original url here fixed it for me. Not sure why that url construction was needed.

ScrapingBrowser.cs
private async Task GetWebResponseAsync(Uri url, HttpWebRequest request)

var cookieUrl =
                        new Uri(string.Format("{0}://{1}:{2}/", response.ResponseUri.Scheme, response.ResponseUri.Host,
                                              response.ResponseUri.Port));
         
                    if (UseDefaultCookiesParser)
                        cookieContainer.SetCookies(url, cookiesExpression);
                    else
                        SetCookies(url, cookiesExpression);

...

Regards,
Bojo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.