caustic
portable scraper templates for mobile apps
getting started
The easiest way to try out caustic is the precompiled utility. Run
$ ./caustic '{"load":"http://www.google.com","then":{"find":"Feeling\\s[\\w]*","name":"Feeling?"}}'
in the terminal of your choice. This executes the JSON instruction
{
"load" : "http://www.google.com",
"then" : {
"find" : "Feeling\\s[\\w]*",
"name" : "Feeling?"
}
}
and sends the results to stdout
scope | source | name | value |
---|---|---|---|
1 | 0 | Feeling? | Feeling Lucky |
2 | 0 | Feeling? | Feeling Lucky |
First, caustic loads the URL in load. Then it looks for the regular expression in find, and saves all matches.
the instruction format
Caustics instructions are logic-free JSON objects that provide very dynamic templated instructions for scraping data. By default, substitutions are done for text inside double-curlies {{}}, kind of like mustache.
All caustic instructions are built from finds and loads.
Here's a simple instruction, which is one of the demos:
{
"load" : "http://www.google.com/search?q={{query}}",
"then" : {
"find" : "{{query}}\\s+(\\w+)",
"replace" : "I say '$1'!",
"name" : "what do you say after '{{query}}'?"
}
}
For caustic to execute this instruction, it needs a value to substitute for {{query}}. Run the following
$ ./caustic demos/simple-google.json --input="query=hello"
to replace {{query}} with hello. We get the following
scope | source | name | value |
---|---|---|---|
0 | query | hello | |
1 | 0 | what do you say after 'hello'? | I say 'kitty'! |
2 | 0 | what do you say after 'hello'? | I say 'lyrics'! |
3 | 0 | what do you say after 'hello'? | I say 'lionel'! |
4 | 0 | what do you say after 'hello'? | I say 'kitty'! |
5 | 0 | what do you say after 'hello'? | I say 'beyonce'! |
6 | 0 | what do you say after 'hello'? | I say 'beyonce'! |
7 | 0 | what do you say after 'hello'? | I say 'glee'! |
8 | 0 | what do you say after 'hello'? | I say 'movie'! |
Not only is google queried for hello, but the substitution affects the name and replace of find.
We can also see that find can match multiple times.
We can use backreferences from $0 to $9 in replace.
advanced substitutions
Substitutions are a powerful tool because they develop over the course of execution. Any name that appears in curlies will be substituted once a value has been found for it.
This demo
{
"load" : "http://www.google.com/search?q={{query}}",
"then" : {
"find" : "{{query}}\\s+(\\w+)",
"replace" : "$1",
"name" : "after",
"then" : {
"load" : "http://www.google.com/search?q={{after}}",
"then" : {
"find" : "{{query}}\\s+(\\w+)",
"replace" : "I say '$1'!",
"name" : "what do you say after '{{after}}'?"
}
}
}
}
takes advantage of dynamic substitution, along with the ability to place any number of load or find instructions inside then. It launches a whole new series of queries!
Try it with
$ ./caustic demos/complex-google.json --input="query=hello"
You'll see that this results in quite a few dozen rows, but here are some highlights:
scope | source | name | value |
---|---|---|---|
48 | 14 | what do you say after 'beyonce'? | I say 'wedding'! |
49 | 14 | what do you say after 'beyonce'? | I say 'songs'! |
50 | 14 | what do you say after 'beyonce'? | I say 'youtube'! |
51 | 14 | what do you say after 'beyonce'? | I say 'jay'! |
52 | 14 | what do you say after 'beyonce'? | I say 'diet'! |
53 | 14 | what do you say after 'beyonce'? | I say 'albums'! |
54 | 14 | what do you say after 'beyonce'? | I say 'biography'! |
55 | 14 | what do you say after 'beyonce'? | I say 'lyrics'! |
56 | 15 | what do you say after 'glee'? | I say 'episodes'! |
57 | 15 | what do you say after 'glee'? | I say 'tv'! |
58 | 15 | what do you say after 'glee'? | I say 'spoilers'! |
59 | 15 | what do you say after 'glee'? | I say 'songs'! |
60 | 15 | what do you say after 'glee'? | I say 'soundtrack'! |
61 | 15 | what do you say after 'glee'? | I say 'cast'! |
62 | 15 | what do you say after 'glee'? | I say 'wiki'! |
63 | 16 | what do you say after 'movie'? | I say 'download'! |
Note that the source column links each find result back to the scope it inherits from.
references
You probably noticed that interior portion of the last demo was basically copy-and-pasted from the demo before it. Wouldn't it be nice if we could reuse instruction components?
This demo does just that
{
"load" : "http://www.google.com/search?q={{query}}",
"then" : {
"find" : "{{query}}\\s+(\\w+)",
"replace" : "$1",
"name" : "after",
"then" : "simple-google.json"
}
}
Running
$ ./caustic demos/complex-google.json --input="query=hello"
should give you the same results as before. Any string appearing inside then will be evaulated as a reference.
remote templates
Templates can be accessed remotely. Running
$ ./caustic https://raw.git https://github.com/talos/caustic/blob/master/demos/simple-google.json --input="query=hello"
will do the first demo. References can be remote, too, even if the
file is local. The prior demo will work the same if you alter then
to read
https://github.com/talos/caustic/blob/master/demos/simple-google.json
recursion
What if you want a scraper to run itself? No problem:
{
"load" : "http://www.google.com/search?q={{query}}",
"then" : {
"find" : "{{query}}\\s+(\\w+)",
"replace" : "$1",
"name" : "query",
"then" : "$this"
}
}
When inside then, $this evaluates to be the entire object. This evaluation is only performed when then operates.
Remember that
$ ./caustic demos/recursive-google.json --input="query=hello"
will not stop on its own!
Why?
Caustic is designed to give wider access to obscure public data. The caustic format makes it easy to quickly design and test a scraper that extracts a few pieces of information from behind several layers of obfuscation.