Giter Club home page Giter Club logo

linkcrawler's Introduction

LinkCrawler

Simple C# console application that will crawl the given webpage for broken image-tags and hyperlinks. The result of this will be written to output. Right now we have these outputs: console, csv, slack.

Why?

Because it could be useful to know when a webpage you have responsibility for displays broken links to it's users. I have this running continuously, but you don't have to. For instance, after upgrading your CMS, changing database-scheme, migrating content etc, it can be relevant to know if this did or did not not introduce broken links. Just run this tool one time and you will know exactly how many links are broken, where they link to, and where they are located.

Build

Clone repo ๐Ÿ‘‰ open solution in Visual Studio ๐Ÿ‘‰ build ๐Ÿ‘Š

AppVeyor is used as CI, so when code is pushed to this repo the solution will get built and all tests will be run.

Branch Build status
develop Build status
master Build status

AppSettings

Key Usage
BaseUrl Base url for site to crawl
SuccessHttpStatusCodes HTTP status codes that are considered "successful". Example: "1xx,2xx,302,303"
CheckImages If true, <img src=".." will be checked
ValidUrlRegex Regex to match valid urls
Slack.WebHook.Url Url to the slack webhook. If empty, it will not try to send message to slack
Slack.WebHook.Bot.Name Custom name for slack bot
Slack.WebHook.Bot.IconEmoji Custom Emoji for slack bot
OnlyReportBrokenLinksToOutput If true, only broken links will be reported to output.
Slack.WebHook.Bot.MessageFormat String format message that will be sent to slack
Csv.FilePath File path for the CSV file
Csv.Overwrite Whether to overwrite or append (if file exists)
Csv.Delimiter Delimiter between columns in the CSV file (like ',' or ';')
PrintSummary If true, a summary will be printed when all links have been checked.

Ther also is a <outputProviders> that controls what output should be used.

Output to file

LinkCrawler.exe >> crawl.log will save output to file. Slack

Output to slack

If configured correctly, the defined slack-webhook will be notified about broken links. Slack

##How I use it I have it running as an Webjob in Azure, scheduled every 4 days. It will notify the slack-channel where the editors of the website dwells.

Creating a webjob is simple. Just put your compiled project files (/bin/) inside a .zip, and upload it. Slack

Schedule it.

Slack

The output of a webjob is available because Azure saves it in log files. Slack

Read more about Azure Webjobs: https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/

Read more about Slack incoming webhooks: https://api.slack.com/incoming-webhooks

linkcrawler's People

Contributors

davetolan avatar emeryweistdeusto avatar hmol avatar joose1992 avatar justingunther avatar melkor54248 avatar mgroves avatar niklashansen avatar paulhtrott avatar tdwright avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linkcrawler's Issues

If url goes to file, dont try to fetch markup

When crawling http://www.the-website-to-crawl.com and the application gets to a url for a file; http://www.the-website-to-crawl.com/reports/report.pdf , it will throw exception. This is beacuse it will then try to fetch html markup from pdf-file. So, when getting to a url for file, dont fetch markup, just continue.

Proxy support

RestSharp provides proxy support, so can this be surfaced in LinkCrawler too, ideally as a setting through app.config?

I'm happy to make the change and roll it up into the work I'll be doing with #12 (with separate pull requests of course)

Question - Twitter support

I've just stumbled across this and love it... but I'd like to add support for tweeting out broken links automatically. Similar to the slack option currently just another platform for it I guess.

I've got experience working with Twitter's API and associated .Net libs for it so can't see it being particularly tricky, just wanted to see if it would be a welcome addition from your point of view before I go off and fork etc.

If the addition would be welcome let me know and I'll provide an outline of how I'd plan on doing it. ๐Ÿ˜ƒ

Follow redirects feature

If the server responds with redirect (301 or 302), the crawler should follow these redirects (like curl is able for example), otherwise it's missing a bunch of crawlable content

All IOutput implementations are always used

Unless I'm misunderstanding something, it seems like all IOutput implementations are always used for each 'error' response.

One thing that might be nice is that currently your program will try to output to ALL implementation of IOutput. It might be better to all that to be configurable (e.g. in app config).

Something like:

So, if you keep adding output options (as in issue #6), you don't have to keep writing to all of them. Does that sound like a useful feature?

Exception when writing output to CSV

Just came across this when I enabled outputting to CSV on a website.

System.IndexOutOfRangeException was unhandled by user code
  HResult=-2146233080
  Message=Probable I/O race condition detected while copying memory. The I/O package is not thread safe by default. In multithreaded applications, a stream must be accessed in a thread-safe way, such as a thread-safe wrapper returned by TextReader's or TextWriter's Synchronized methods. This also applies to classes like StreamWriter and StreamReader.
  Source=mscorlib
  StackTrace:
       at System.Buffer.InternalBlockCopy(Array src, Int32 srcOffsetBytes, Array dst, Int32 dstOffsetBytes, Int32 byteCount)
       at System.IO.StreamWriter.Write(Char[] buffer, Int32 index, Int32 count)
       at System.IO.TextWriter.WriteLine(String value)
       at System.IO.TextWriter.WriteLine(String format, Object[] arg)
       at LinkCrawler.Utils.Outputs.CsvOutput.Write(IResponseModel responseModel) in C:\Users\Chris\Downloads\LinkCrawler-develop\LinkCrawler\LinkCrawler\Utils\Outputs\CsvOutput.cs:line 43
       at LinkCrawler.Utils.Outputs.CsvOutput.WriteInfo(IResponseModel responseModel) in C:\Users\Chris\Downloads\LinkCrawler-develop\LinkCrawler\LinkCrawler\Utils\Outputs\CsvOutput.cs:line 38
       at LinkCrawler.LinkCrawler.WriteOutput(IResponseModel responseModel) in C:\Users\Chris\Downloads\LinkCrawler-develop\LinkCrawler\LinkCrawler\LinkCrawler.cs:line 92
       at LinkCrawler.LinkCrawler.ProcessResponse(IResponseModel responseModel) in C:\Users\Chris\Downloads\LinkCrawler-develop\LinkCrawler\LinkCrawler\LinkCrawler.cs:line 59
       at LinkCrawler.LinkCrawler.<>c__DisplayClass31_0.<SendRequest>b__0(IRestResponse response) in C:\Users\Chris\Downloads\LinkCrawler-develop\LinkCrawler\LinkCrawler\LinkCrawler.cs:line 53
       at RestSharp.RestClientExtensions.<>c__DisplayClass1.<ExecuteAsync>b__0(IRestResponse response, RestRequestAsyncHandle handle)
       at RestSharp.RestClient.ProcessResponse(IRestRequest request, HttpResponse httpResponse, RestRequestAsyncHandle asyncHandle, Action`2 callback)
       at RestSharp.RestClient.<>c__DisplayClass3.<ExecuteAsync>b__0(HttpResponse r)
       at RestSharp.Http.ExecuteCallback(HttpResponse response, Action`1 callback)
       at RestSharp.Http.<>c__DisplayClass15.<ResponseCallback>b__13(HttpWebResponse webResponse)
       at RestSharp.Http.GetRawResponseAsync(IAsyncResult result, Action`1 callback)
       at RestSharp.Http.ResponseCallback(IAsyncResult result, Action`1 callback)
  InnerException: 

SMTP support

Add support for emailing an aggregated report using SMTP. See #12 for discussion around specifics of what would be included etc., but general gist is:

  • Crawl a site completely
  • Create an aggregated log of all broken links found
  • Email results list over SMTP

simples :octocat:

(I'll pick this up along with others)

Define (in app.config) what http-statuscodes to treat as success

Right now all requests that is not 1xx or 2xx is treated as failed requests. It could be useful to filter create a filter on this. Maybe you don't want http-status 302 (temporary redirect) to be reported as error.
This is somehting that could be configrable in app.config.
<add key="SuccessHttpStatusCodes" value="1xx,2xx,302,303" />

Improve summary to include breakdown of link counts by status

Now that we've got the option to output the elapsed time at the end of processing (PR #21), it might be good to offer a summary table too. This could list the different statuses alongside the counts of links for each. E.g.:

Status Links
200 113
301 12
302 6
404 3
418 1

To achieve this we could turn the two lists of strings into a list of objects that contain the link URL, a bool for whether or not we've processed the response yet, and a field for the status code.

Proposed subtasks:

  • Replace the two string lists with a list of instances of a new class
  • Include a summary table when outputting the time elapsed

Thoughts very welcome.

Local links

Hello!

I like this project and I would like to colaborate. But I just crawled a website with LinkCrawler and it only shows the links with http, but it dont look into local links like a href="html_images.html" or similars.

Is it ok? or it will be developed in a future? and what happen with deep linking?

Regards

local links shown as missing

Don't know if this is a bug or a feature, but should local links be shown with status 0? e.g. links that are missing the base url (/page.html)? I assume they're also not crawled afterwards

Moved URL's aren't crawled

A url that returns a 301 (moved permanently) doesn't get crawled afterwards.
I seem to be having a bunch of those where url's not ending in a slash give a 301 to the page containing the trailing slash.

I'm guessing this is not intended.
If not, let me know and I have some code ready that fixes this which needs some cleanup and perhaps some unit tests.

Broken links on Readme

It looks like the example links on the readme are broken.
Things like "Example run with console output" go to a page that says "Cannot proxy the given URL".

Thank you :)

Support sites with login or with age gate

As an extra feature I think it would be beneficial if this supported sites that block first page with age gate or that have login (and most of the content is after login only)

Is this even working ?

After built and run
I only see these on the console

0 0 https://github.com
Referer:

Been waiting for half hour and nothing else shown.
I tried both with develop and master branch.

How to use this tool ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.