Giter Club home page Giter Club logo

voyager's Introduction

voyager

github crates.io docs.rs build status

With voyager you can easily extract structured data from websites.

Write your own crawler/scraper with voyager following a state machine model.

Example

The examples use tokio as its runtime, so your Cargo.toml could look like this:

[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1", features = ["full"] }

Declare your own Scraper and model

// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
    post_selector: Selector,
    author_selector: Selector,
    title_selector: Selector,
    comment_selector: Selector,
    max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
    Page(usize),
    Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
    author: String,
    url: Url,
    link: Option<String>,
    title: String,
}

Implement the voyager::Scraper trait

A Scraper consists of two associated types:

  • Output, the type the scraper eventually produces
  • State, the type, the scraper can drag along several requests that eventually lead to an Output

and the scrape callback, which is invoked after each received response.

Based on the state attached to response you can supply the crawler with new urls to visit with, or without a state attached to it.

Scraping is done with causal-agent/scraper.

impl Scraper for HackernewsScraper {
    type Output = Entry;
    type State = HackernewsState;

    /// do your scraping
    fn scrape(
        &mut self,
        response: Response<Self::State>,
        crawler: &mut Crawler<Self>,
    ) -> Result<Option<Self::Output>> {
        let html = response.html();

        if let Some(state) = response.state {
            match state {
                HackernewsState::Page(page) => {
                    // find all entries
                    for id in html
                        .select(&self.post_selector)
                        .filter_map(|el| el.value().attr("id"))
                    {
                        // submit an url to a post
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/item?id={}", id),
                            HackernewsState::Post,
                        );
                    }
                    if page < self.max_page {
                        // queue in next page
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/news?p={}", page + 1),
                            HackernewsState::Page(page + 1),
                        );
                    }
                }

                HackernewsState::Post => {
                    // scrape the entry
                    let entry = Entry {
                        // ...
                    };
                    return Ok(Some(entry))
                }
            }
        }

        Ok(None)
    }
}

Setup and collect all the output

Configure the crawler with via CrawlerConfig:

  • Allow/Block list of Domains
  • Delays between requests
  • Whether to respect the Robots.txt rules

Feed your config and an instance of your scraper to the Collector that drives the Crawler and forwards the responses to your Scraper.

use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    
    // only fulfill requests to `news.ycombinator.com`
    let config = CrawlerConfig::default().allow_domain_with_delay(
        "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
    );
    
    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(
        "https://news.ycombinator.com/news",
        HackernewsState::Page(1),
    );

    while let Some(output) = collector.next().await {
        let post = output?;
        dbg!(post);
    }
    
    Ok(())
}

See examples for more.

Inject async calls

Sometimes it might be helpful to execute some other calls first, get a token etc., You submit async closures to the crawler to manually get a response and inject a state or drive a state to completion

fn scrape(
    &mut self,
    response: Response<Self::State>,
    crawler: &mut Crawler<Self>,
) -> Result<Option<Self::Output>> {

    // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
    crawler.crawl(move |client| async move {
        let state = response.state;
        let auth = client.post("some auth end point ").send()?.await?.json().await?;
        // do other async tasks etc..
        let new_resp = client.get("the next html page").send().await?;
        Ok((new_resp, state))
    });
    
    // submit a crawling job that completes to `Self::Output` directly
    crawler.complete(move |client| async move {
        // do other async tasks to create a `Self::Output` instance
        let output = Self::Output{/*..*/};
        Ok(Some(output))
    });
    
    Ok(None)
}

Recover a state that got lost

If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as CrawlError, which carries the last valid state. The error then can be down casted.

let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {
  match output {
    Ok(post) => {/**/}
    Err(err) => {
      // recover the state by downcasting the error
      if let Ok(err) = err.downcast::<CrawlError<<HackernewsScraper as Scraper>::State>>() {
        let last_state = err.state();
      }
    }
  }
}

Licensed under either of these:

voyager's People

Contributors

bastienriviereocd avatar mattsse avatar steventrouble avatar utterstep avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

voyager's Issues

scrape function never gets called

Hi there! im new to rust so please bear with me... I'm writting a small crawler but the crawl function never actually gets called... i tried ro use rust-lldb and the collector seems to queue the request but the scrape function never actually gets called after that and the program just exists... can you please point me in the right direction?

Multiple requests at the same time

Hi, is there any way to make the requests parallel and limit them?
I want to make a scraper for a image based website, however the cdn is pretty slow compared to the site itself, is there any way to implement the scraper and make it download images in parallel since they are slow to download?

Handling of rate-limiting websites

I have a case where a webpage is using Cloudflare and after a number of requests I get 429 error. The response contains Retry-After header with a value of how many seconds we should send the next request.

Currently, is there a way to handle this?

Hackernews example doesn't work out of the box

This selector not works correctly

impl Default for HackernewsScraper {
    fn default() -> Self {
        Self {
-           post_selector: Selector::parse("table.itemlist tr.athing").unwrap(),
+           post_selector: Selector::parse("#hnmain tr.athing").unwrap(),
            author_selector: Selector::parse("a.hnuser").unwrap(),
            title_selector: Selector::parse("td.title a").unwrap(),
            comment_selector: Selector::parse(".comment-tree tr.athing").unwrap(),
            max_page: 1,
        }
    }
}

This edit helps. But I didn't dive deep into the structure of the document to say that this always works, but if the edit is in demand - I can do a PR with normal testing

[Question] go through all pages on a site?

Hi,

i'm trying to use voyager to do some testing when developing a website,
and i want to basically have it go through unique pages/links that exists (that are on that domain).
And the way i have done it seems to somewhat work, but it becomes extremely slow quite fast,
and it feels way too slow than what one should expect so i believe i'm doing something wrong.

here is a sample of the code:

use anyhow::Result;
use futures::StreamExt;
use reqwest::{ClientBuilder, Url};
use std::collections::{HashMap, HashSet};
use voyager::scraper::Selector;
use voyager::{Collector, Crawler, CrawlerConfig, Response, Scraper};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    struct Explorer {
        /// visited urls mapped with all the urls that link to that url
        visited: HashMap<Url, HashSet<Url>>,
        link_selector: Selector,
    }
    impl Default for Explorer {
        fn default() -> Self {
            Self {
                visited: Default::default(),
                link_selector: Selector::parse("a").unwrap(),
            }
        }
    }

    impl Scraper for Explorer {
        type Output = (usize, Url);
        type State = Url;

        fn scrape(
            &mut self,
            mut response: Response<Self::State>,
            crawler: &mut Crawler<Self>,
        ) -> Result<Option<Self::Output>> {
            if let Some(origin) = response.state.take() {
                if self.visited.contains_key(&response.response_url) {
                    // println!("already visited: {}", &response.response_url);
                    return Ok(None);
                }
                self.visited
                    .entry(response.response_url.clone())
                    .or_default()
                    .insert(origin);
            }

            for link in response.html().select(&self.link_selector) {
                if let Some(href) = link.value().attr("href") {
                    if let Ok(url) = response.response_url.join(href) {
                        if let Some(f) = url.fragment() {
                            // println!("{}", url.query().unwrap());
                            // crawler.visit(url);
                            continue;
                        } else {
                            crawler.visit_with_state(url, response.response_url.clone());
                        }
                    }
                }
            }

            Ok(Some((response.depth, response.response_url)))
        }
    }

    let config = CrawlerConfig::default()
        // stop after 3 jumps
        .max_depth(20)
        .set_client(
            ClientBuilder::new()
                .danger_accept_invalid_certs(true)
                .https_only(true)
                .connect_timeout(std::time::Duration::from_secs(10))
                .build()
                .unwrap(),
        )
        .allow_domain("localhost")
        // maximum of requests that are active
        .max_concurrent_requests(5);
    let mut collector = Collector::new(Explorer::default(), config);

    collector.crawler_mut().visit("https://localhost");

    while let Some(output) = collector.next().await {
        if let Err(e) = &output {
            // println!("{}", e);
        }
        if let Ok((depth, url)) = output {
            println!("Visited {} at depth: {}", url, depth);
        }
    }

    Ok(())
}

it seems like a lot of time it's skipping pages/links (which I have told it too if it's already visited them),
but it's slower than one would think, and if i add a delay to prevent overloading my webserver it takes forever.
and i can see a lot of load even when no unique site is logged, so it accesses those it's already visited anyway it seems.

`thirtyfour` integration

Would you consider integrating thirtyfour for when css isn't enough?

I don't need it for anything I'm currently working on but I'd like to take a stab at it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.