sjdirect / abotx Goto Github PK

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

C# 100.00%

abotx abotx-website cross-platform csharp csharp-library framework headless headless-br headless-browser javascript-renderer netcore netcore3 netstan netstandard netstandard-libraries netstandard20 spider spiders spiders- web-crawler

abotx's People

Contributors

Stargazers

Watchers

abotx's Issues

Robots.txt is not reloaded when uri scheme is changed (http/https)

Hello,

We found Abot a few days ago, and we try its free version to see if it can meet our needs.

Everything worked fine until we noticed that it crawls urls which were 'Disallow' in robots.txt.

After some debugging, we ended up that it binds robots.txt with the initial uri scheme which is provided with the site to crawl, eg for the site https://mysite.com disallowed urls works only for https. If there is a link to http://mysite.com/somepage, Abot will ignore robots.txt and will crawl it.

*Assuming we have the following Robots.txt
User-agent: *
Disallow: /somepage

Could you help us how to deal with this issue?
Thank you

Text issue on abotx.org website

In the example it uses :
crawler.CrawlConfigurationX.IsJavascriptRenderingEnabled = true;
crawler.CrawlConfigurationX.MaxConcurrentSiteCrawls = 1; //Only crawl a single site at a time
crawler.CrawlConfigurationX.MaxConcurrentThreads = 8;

Instead of the CrawlConfigurationX class.

custom implementations for IThreadManager doesn't work

Hi,

when using a custom implementation for IThreadManager, injected through AbotX.Poco.ImplementationContainer.ThreadManager, i get the following exception:

2018-06-12 06:04:00,229 [12] FATAL - [AbotLogger] (0) System.InvalidOperationException: Cannot call DoWork() after AbortAll() or Dispose() have been called.
   bei Abot.Util.ThreadManager.DoWork(Action action)
   bei Abot.Crawler.WebCrawler.CrawlSite()
   bei Abot.Crawler.WebCrawler.Crawl(Uri uri, CancellationTokenSource cancellationTokenSource)

The exception is thrown when the crawler crawls the second (and following) seeds in the same thread. While stepping through the code, i saw that Abot.Crawler.WebCrawler got the thread manager injected through the constructor and disposes it while executing Abot.Crawler.WebCrawler.Crawl(...). When not using a custom thread manager, the default one is constructed in AbotX.Parallel.WebCrawlerFactory.CreateInstance(SiteToCrawl) through the AbotX.Core.ImplementationOverride constructor for every crawl. When using the implementation override, the instance passed to AbotX.Poco.ImplementationContainer.ThreadManager will be reused for every crawl. So, this instance gets disposed after the first crawl and throws the exception above for all following crawls.

The DateTime represented by the string is out of range

From customer...

We get the following exception “The DateTime represented by the string is out of range” (when the expiration date is being parsed as DateTime).
I saw others had the same problem in the following thread: dnauck/Portable.Licensing#26.
I tried to set the date to be 200 years from now just as a try, but then I get an exception that the expiration date not match the signature.

Crawl through a proxy server

Hi,

I'm trying to figure out how to configure the crawler to use a proxy serve/port for connecting to the destination website, but I don't seem to be able to find any information for that.

Is there any way of doing that?

Javascript rendering - detecting window.location changes

What would be your recommended way of dealing with window.location changes on the page? I'm crawling sites that have a method that looks something like the following probably to break crawlers:

function iframeOnLoad(){
  var reqUrl='https://domain.com/page_i_want';
  setTimeout(function() { window.location = reqUrl }, 3000);
}

<iframe onload="iframeOnload()" />

Assuming PhantonJs is rendering this, is it possible to detect url changes when window.location is set via JS? I could maybe write some custom addons but I'm not sure if this is already handled somehow.

When I try to instantiate the ParallelCrawlerEngine, I get "Your current AbotX license does not include Auto Tuning."

User reported...

-------------------------------User Reported------------------------------

When I try to instantiate the ParallelCrawlerEngine, I get the below exception. However, I checked the value of AutoTuning.IsEnabled, and it's set to false. So is AutoThrottling.IsEnabled.

The version of AbotX installed is: 1.2.28, installed via NuGet. This also occurred with the previous version, which I believe was 1.1.x

Also, beginning with version 1.2.28, I now get a warning about my call of the ParallelCrawlerEngine's constructor being obsolete. Here is my current constructor call:

var crawlEngine = new ParallelCrawlerEngine(cx, abotFactory, null, siteToCrawlProvider);

Intellisense says that I should be using a constructor that refers to a 'ParallelImplementationOverride' - what's that about?

-------------------------------Stack Trace------------------------------

System.UnauthorizedAccessException was unhandled by user code
HResult=-2147024891
Message=Your current AbotX license does not include Auto Tuning. Please change AutoTuning.IsEnabled to false or upgrade your license.
Source=AbotX
StackTrace:
at AbotX.Core.HostStressAnalyzerCpu.CheckLicense()
at AbotX.Core.HostStressAnalyzerCpu..ctor(CrawlConfigurationX config, ICpuSampler cpuSampler)
at AbotX.Parallel.ParallelImplementationOverride..ctor(CrawlConfigurationX config, ParallelImplementationContainer impls)
at AbotX.Parallel.ParallelCrawlerEngine..ctor(CrawlConfigurationX config, IWebCrawlerFactory webCrawlerFactory, IRateLimiter rateLimiter, ISiteToCrawlProvider siteToCrawlProvider)
at lsSiteCrawler.Crawler.MultipleSitesCrawler.Crawl(Boolean bLoadCSS) in C:\Users\rjones\Documents\Visual Studio 2015\Projects\lsSiteCrawler\src\lsSiteCrawler\Crawler\SiteCrawler.cs:line 797
at lsSiteCrawler.Controllers.HarvestedLinksController.Post() in C:\Users\rjones\Documents\Visual Studio 2015\Projects\lsSiteCrawler\src\lsSiteCrawler\Controllers\HarvestedLinksController.cs:line 218
at lambda_method(Closure , Object , Object[] )
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.d__28.MoveNext()
InnerException:

Implementation override ignoring shortcut delegates

Hi There,

I am using AbotX and specifically the ImplementationOverride, While Scheduler seems to be replaced the other helper methods (ShouldScheduleLink, ShouldCrawlPage, etc..)
Is this a known issue ?

var implementationOverride = new ImplementationOverride(config) {
    Scheduler = new MyScheduler(),
    ShouldScheduleLink = crawler_ShouldScheduleLink,
    ShouldCrawlPage = crawler_ShouldCrawlPage,
    ShouldDownloadPageContent = crawler_ShouldDownloadPageContent,
    ShouldCrawlPageLinks = crawler_ShouldCrawlPageLinks,
}; 
var crawler = new CrawlerX(config, implementationOverride);

AbotX produces huge amount of warnings on Linux

We are using AbotX in an application running on a containerized Ubuntu.
Almost on every page crawl, a warning is logged which reads as Cpu sampling implementation is not supported on this platform. Current implementation uses PerformanceCounter which is only valid on Windows.

Since we are logging warning-level messages too, this is making our logs useless and it's causing problems for our logging server too.

I can see that System.Diagnostics.PreformanceCounter is being referenced by AbotX and since the counter is a Windows-only API and considering the warnings, it gives me a feeling that something is not working as expected on Linux which might have other consequences, too?

Just to give you a feeling of what it currently looks like for us in the logs:

Please advise on what can be done about this.

Parallel engine not working

I am trying to test the parallel engine but it is not working . it is returning after the first page crawl. I am testing using with a license. Would consider to upgrade if it works

Add Elapsed property on the AllCrawlsCompleted event

We would like to be able to retrieve the TimeSpan Elapsed property in the AllCrawlsCompleted event for the ParallelCrawlerEngine. Just like the other events like SiteCrawlCompleted.

For now, we're using a workaround by declaring a Stopwatch variable in the class scope which is started before crawler.StartAsync() is called. Then it is stopped inside AllCrawlsCompleted event to get the TimeSpan for the Elapsed time.

Separate crawlerX's crawl each others sites

I have a Class with public static AlwaysOnSiteToCrawlProvider _siteToCrawlProviderX = new AlwaysOnSiteToCrawlProvider(); and a function that sets up a globalCrawlEngine from a singleton.

If I step through code I can see globalCrawlEngine.parallelCrawler.CrawlerInstanceCreated getting called and all of the data is correct, however by the time PageCrawlCompletedAsync => {} gets called the SiteBag (CrawlBag) data doesn't match the url that has been crawled.

This tends to only happen after a fresh compile of the project and when two requests are made in relatively quick succession.

Any ideas?

Javascript rendering just returns empty html document

If JavaScript is enabled it gives me an empty html content: "\r\n" in PageCrawlCompletedEvent.

Javascript rendering does not work when cookies are required for ajax/cors calls

Currently the cookie handling tranfer from AbotX to PhantomJs fail.

Javascript rendering does not work on Azure Web App or Api managed PAAS

Email contents....

Steven,

Just wanted you to know that even though I’ve managed to get Abot / AbotX running on an Azure WebAPI instance, there is a huge problem with javascript rendering. My WebAPI code was exhibiting horrible performance issues that only seemed to happen when I published the code to an Azure WebAPI instance. When running on my development machine (a Windows 10 VM) it worked just fine. So I opened a support case with Azure support, and they got back to me with why my code was running so slow. It seems that phantomjs.exe is wanting to execute some code that Microsoft disallows in the WebAPI ‘sandboxed’ instance. They identified the code as: NTUserSystemParametersInfo(). They said that the phantomjs executable would attempt this call hundreds of times, each time a failure, before giving up.

So, my fix looks like it’s going to require that I change my code from a WebAPI project to something else that can run on a ‘pure’ Windows VM world (Microsoft support said that there is no problem running phantomjs.exe on a real Windows VM). But before going down that path, I thought you’d like to know this information, because I believe that at one time you had told me that I am the first of your customers to try to run Abot / AbotX on an Azure WebAPI instance. Also, I thought you might have an idea of a workaround (short of disabling javascript rendering!) that would save me the effort of re-writing my code.

Rob Jones

How would I crawl a single site with multiple pages in parallel?

Hi,

Thanks for the product!

Apologies for the many questions.

How would I crawl a single site with multiple pages in parallel?
Do I need AbotX or Abot would do?
Do I need to loop through the list of sites if I can only do 3 at a time for the free version?
Is it ideal to have this in a job that keeps track of runs?
Also it doesn't say which part of the code I get the crawled data...is it in crawlEngine.SiteCrawlCompleted, after the lock(crawlCounts){...} statment?

Example

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;
                
            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                
            
            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }

Error when Inititalizing CrawlerX/ParallelCrawlEngine (System.FormatException)

I have configured as guided in address but still appeared http://abotx.org/Learn/Configuration An unhandled exception of type 'error' System.FormatException 'print lỗi mscorlib.dll

Additional information: The datetime represented by the string is out of range.

when I initialize paragraph following command:
var crawler = new CrawlerX();//Errror when Init object CrawlerX
Here is the configuration in my app.config file:

</ ConfigSections>

</ ExtensionValues>
</ Abot>

</ Configuration>

You can send me an example using AbotX.

Thanks

When calling crawler.Crawl, getting NullReferenceException at RenderJavascript

NullReferenceException

   em AbotX.Crawler.CrawlerX.RenderJavascript(CrawledPage crawledPage, CrawlContext crawlContext)
   em AbotX.Crawler.CrawlerX.CrawlThePage(PageToCrawl pageToCrawl)
   em Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl)

abotx.org SSL certificate failing

AbotX not respecting CrawlConfigurationX MaxPagesToCrawlPerDomain and MaxCrawlDepth

I'm testing this and the crawler is currently on page ~55,000 and several layers deep for one of the three domains that I am crawling on the test. The code I used to load the configuration is below. I load from the app config xml and then override some of the settings in the method to customize the crawl based on user input for specific test crawls that I'm running. The two values in question are hard coded to 1000 and 1 respectively for this test. Am I doing something wrong?

var config = AbotXConfigurationSectionHandler.LoadFromXml().Convert();
config.CrawlTimeoutSeconds = timeoutMilliseconds / 1000;
config.HttpRequestTimeoutInSeconds = timeoutMilliseconds / 1000;
config.JavascriptRenderingWaitTimeInMilliseconds = timeoutMilliseconds;
config.MaxCrawlDepth = 1; //set for testing only
config.JavascriptRenderingWaitTimeInMilliseconds = javascriptTimeout;
config.MaxPagesToCrawlPerDomain = 1000; //set for testing only
ParallelImplementationOverride impls = new ParallelImplementationOverride(config);
impls.SiteToCrawlProvider.AddSitesToCrawl(sites);
ParallelCrawlerEngine crawlEngine = new ParallelCrawlerEngine(config, impls);

Incompatibility with net5.0

Consider this piece of code from AbotX readme page:

var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );

It works fine on netcoreapp3.1, however on net5.0, the line new ParallelImplementationOverride(config... raises the exception System.Security.Cryptography.CryptographicException: ASN1 corrupted data. originating from System.Security.Cryptography.Algorithms. and the execution of the code is fatally interrupted.

This makes it impossible to use the parallel crawler on .net 5.

crawledPage.HttpRequestException. on https://aanhangwagenspattyn.be/ while ok in browser or Postman

While in FireFox this site ( = example, there are multiple sites ) opens without problems, Abot is having a problem with it.
The Error message refers to an "invalid or unrecognized response"
After some more digging:
When using Postman, with the exact same request, the site returns and responds normally
It look as if the HttpClient.sendAsync in the PageRequester seems to cause the problem

thanks in advance

Ghislain

Output of Abot2Demo

Did not crawl the links on page https://aanhangwagenspattyn.be/ due to Page has no content
ERR: Crawl of page failed crawledPage.HttpRequestException.InnerException = System.IO.IOException: The server returned an invalid or unrecognized response.
at System.Net.Http.HttpConnection.FillAsync()
at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed)
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
Page had no content https://aanhangwagenspattyn.be/

Current AbotX license does not include rendering of javascript.

Hello!
I downloaded the .lic from the repository and saved it in the root of the project.
When I try to reproduce the example from README, I get an exception:
System.UnauthorizedAccessException: Your current AbotX license does not include rendering of javascript. Please change IsJavascriptRenderingEnabled to false or upgrade your license.

Problems with Javascript rendering (phantomjs.exe not found)

Recent changes to nuget have made some target framework installations ignore the install.ps1 of the PhantomJS 2.1.1 nuget package that AbotX relies on. The side effect is that the phantomjs.exe file does not get copied to the output directory which then fails during javascript rendering.

The workaround to manually copy the executable from the nuget installation directory "packages\PhantomJS.2.1.1\tools\phantomjs\phantomjs.exe" to your project root and then set it as "content" and "copy if newer".

See this stackoverflow answer for more details on how to setup your solution to copy to output directory...

If paid feature is enabled, Abotx doesn't do anything and no warning or exception is present

I copied some code from here and Abotx didn't work. I tried a few things and found out that if I turned off autotuning, it works. There's no warning or exception to tell me that autotuning is a paid feature and I had to find out the hard way.

I suggest making paid features need a license through an exception message so it's very clear from the start.

Cannot terminate an individual CrawlX intance on a Parallel crawl

e.CrawlContext.CancellationTokenSource.Cancel() in the pageCrawlCompleted event stops all concurrent crawls. NOT just the one that was intended to be cancelled.

Need to implement some type of release notes

Hey Steven.

I see all of these updates for a abotx in my budget package manager but I'm having a difficult time seeing or finding the revision notes.
Am I missing something obvious or don't you publish revision notes. ?

Javascript rendering even if ShouldRenderJavascript returns false

CrawlDecisionMakerX - ShouldRenderJavascript() is not virtual so no way to inherit it. However even if we override and return false the following log file shows that js is still attempted to be rendered.

[2017-07-09 16:46:17,733] [4] [DEBUG] - Page [https://XXX/sitemap.xml] did not have javascript rendered, [not an html page] - [AbotLogger]
[2017-07-09 16:46:17,762] [4] [DEBUG] - Rendering javascript for page [https://XXXX/sitemap.xml] - [AbotLogger]

User agent

the UA works correctly on ABOT, but in ABOTX it not working correctly

AbotX.lic

The beta license file for Abotx.lic does not seem to allow me to use the javascript rendering.

[2016-02-17 21:01:09,557] [6] [FATAL] - System.UnauthorizedAccessException: Your current AbotX license does not include rendering of javascript. Please change IsJavascriptRenderingEnabled to false or upgrade your license.
at AbotX.Core.PhantomJsRenderer.IsLicensed()

But I have placed the lic file in the bin directory.

What happened to some of the properties that were in v1?

I had a program that used v1 of abotx. When I upgraded to v2, some of the properties are giving errors because they don't exist anymore. What happened to those properties?

Like IsExternalPageLinksCrawlingEnabled, DownloadableContentTypes, MaxCrawlDepth, IsExternalPageCrawlingEnabled...

crash when stop

hello, i got error when trying to stop.
System.OperationCanceledException: 'The operation was canceled.'
error at
[DoesNotReturn]
private void ThrowOperationCanceledException() =>
throw new OperationCanceledException(SR.OperationCanceled, this);

can you show me how to fix this? tks.

Access to the registry key 'Global' is denied.

Email received...

I'm running your latest code (1.2.44) and it runs just fine when not running under IIS. When I place my code under IIS (ASPNET Core 1.0) I get the following exception:

Access to the registry key 'Global' is denied.

This occurs when the below code is executed:

ParallelImplementationContainer implContainer = new ParallelImplementationContainer();
implContainer.SiteToCrawlProvider = siteToCrawlProvider;
implContainer.WebCrawlerFactory = abotFactory;

So is there any registry access going on in AbotX that I don't know about? Or is this something going on inside of IIS (running on Server 2012 R2)?

Constructing wrong URLs to crawl from anchor tags without scheme

The ParallelCrawlerEngine is getting the wrong URLs to crawl. Upon checking the page in the Parent URI, I could not find where it gets the wrong URL. It's probably the <a> anchor tag without the scheme "https://"

<a href="www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481"> 
    bla bla
</a>

Parent URI:
https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/468046

Parsed Hyperlink (Wrong URL):
https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481

sjdirect / abotx Goto Github PK

abotx's People

Contributors

Stargazers

Watchers

Forkers

abotx's Issues

Recommend Projects

Recommend Topics

Recommend Org