Giter Club home page Giter Club logo

caustic's Introduction

caustic

portable scraper templates for mobile apps

getting started


The easiest way to try out caustic is the precompiled utility. Run

$ ./caustic '{"load":"http://www.google.com","then":{"find":"Feeling\\s[\\w]*","name":"Feeling?"}}'

in the terminal of your choice. This executes the JSON instruction

{
  "load"  : "http://www.google.com",
  "then" : {
    "find" : "Feeling\\s[\\w]*",
    "name" : "Feeling?"
  }
}

and sends the results to stdout

scopesource name value
1 0 Feeling? Feeling Lucky
2 0 Feeling? Feeling Lucky

First, caustic loads the URL in load. Then it looks for the regular expression in find, and saves all matches.

the instruction format


Caustics instructions are logic-free JSON objects that provide very dynamic templated instructions for scraping data. By default, substitutions are done for text inside double-curlies {{}}, kind of like mustache.

All caustic instructions are built from finds and loads.

Here's a simple instruction, which is one of the demos:

{
 "load" : "http://www.google.com/search?q={{query}}",
 "then"  : {
   "find"    : "{{query}}\\s+(\\w+)",
   "replace" : "I say '$1'!",
   "name"    : "what do you say after '{{query}}'?"
 }
}

For caustic to execute this instruction, it needs a value to substitute for {{query}}. Run the following

$ ./caustic demos/simple-google.json --input="query=hello"

to replace {{query}} with hello. We get the following

scopesourcenamevalue
0 query hello
1 0 what do you say after 'hello'? I say 'kitty'!
2 0 what do you say after 'hello'? I say 'lyrics'!
3 0 what do you say after 'hello'? I say 'lionel'!
4 0 what do you say after 'hello'? I say 'kitty'!
5 0 what do you say after 'hello'? I say 'beyonce'!
6 0 what do you say after 'hello'? I say 'beyonce'!
7 0 what do you say after 'hello'? I say 'glee'!
8 0 what do you say after 'hello'? I say 'movie'!

Not only is google queried for hello, but the substitution affects the name and replace of find.

We can also see that find can match multiple times.

We can use backreferences from $0 to $9 in replace.

advanced substitutions


Substitutions are a powerful tool because they develop over the course of execution. Any name that appears in curlies will be substituted once a value has been found for it.

This demo

{
  "load" : "http://www.google.com/search?q={{query}}",
  "then"  : {
    "find"    : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"    : "after",
    "then" : {
      "load" : "http://www.google.com/search?q={{after}}",
      "then" : {
        "find"    : "{{query}}\\s+(\\w+)",
        "replace" : "I say '$1'!",
        "name"    : "what do you say after '{{after}}'?"
      }
    }
  }
}

takes advantage of dynamic substitution, along with the ability to place any number of load or find instructions inside then. It launches a whole new series of queries!

Try it with

$ ./caustic demos/complex-google.json --input="query=hello"

You'll see that this results in quite a few dozen rows, but here are some highlights:

scope source name value
48 14 what do you say after 'beyonce'? I say 'wedding'!
49 14 what do you say after 'beyonce'? I say 'songs'!
50 14 what do you say after 'beyonce'? I say 'youtube'!
51 14 what do you say after 'beyonce'? I say 'jay'!
52 14 what do you say after 'beyonce'? I say 'diet'!
53 14 what do you say after 'beyonce'? I say 'albums'!
54 14 what do you say after 'beyonce'? I say 'biography'!
55 14 what do you say after 'beyonce'? I say 'lyrics'!
56 15 what do you say after 'glee'? I say 'episodes'!
57 15 what do you say after 'glee'? I say 'tv'!
58 15 what do you say after 'glee'? I say 'spoilers'!
59 15 what do you say after 'glee'? I say 'songs'!
60 15 what do you say after 'glee'? I say 'soundtrack'!
61 15 what do you say after 'glee'? I say 'cast'!
62 15 what do you say after 'glee'? I say 'wiki'!
63 16 what do you say after 'movie'? I say 'download'!

Note that the source column links each find result back to the scope it inherits from.

references


You probably noticed that interior portion of the last demo was basically copy-and-pasted from the demo before it. Wouldn't it be nice if we could reuse instruction components?

This demo does just that

{
  "load" : "http://www.google.com/search?q={{query}}",
  "then"  : {
    "find"    : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"    : "after",
    "then"    : "simple-google.json"
  }
}

Running

$ ./caustic demos/complex-google.json --input="query=hello"

should give you the same results as before. Any string appearing inside then will be evaulated as a reference.

remote templates


Templates can be accessed remotely. Running

$ ./caustic https://raw.git https://github.com/talos/caustic/blob/master/demos/simple-google.json --input="query=hello"

will do the first demo. References can be remote, too, even if the file is local. The prior demo will work the same if you alter then to read https://github.com/talos/caustic/blob/master/demos/simple-google.json

recursion


What if you want a scraper to run itself? No problem:

{
  "load"  : "http://www.google.com/search?q={{query}}",
  "then" : {
    "find"     : "{{query}}\\s+(\\w+)",
    "replace" : "$1",
    "name"   : "query",
    "then"   : "$this"
  }
}

When inside then, $this evaluates to be the entire object. This evaluation is only performed when then operates.

Remember that

$ ./caustic demos/recursive-google.json --input="query=hello"

will not stop on its own!

Why?


Caustic is designed to give wider access to obscure public data. The caustic format makes it easy to quickly design and test a scraper that extracts a few pieces of information from behind several layers of obfuscation.

caustic's People

Contributors

talos avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.