clericpy / uniparser Goto Github PK
View Code? Open in Web Editor NEWProvides a general low-code parsing solution.
Provides a general low-code parsing solution.
100000 CrawlerRules usage 400 MB memory
If more than 100000, use sqlite / redis / mysql / mongodb.
Parse result
{
'parse_rule': 'This is article title',
'__child__': {
'rule1': 'This is hello world',
'__child__': {
'rule2': 'dlrow olleh si sihT',
'rule3': 'dlrow olleh si sihT'
}
}
}
change into
{'parse_rule': {'rule1': {'rule2': 'dlrow olleh si sihT', 'rule3': 'hello world'}}}
Parsers for unknown input and tries all the matched rules
frequency control
proxy
max_retry
global context
cc = compile('2**30', "", "exec")
run 2**30
in origin env, the performance is 90 times faster than exec(string), but similar as exec(cc)
py -> python
Let the settings can be modified
Schema variables constraint
flow chart to describe the rule relationship
def find_rule(url, storage):
host = urlparse(url).netloc
rules = storage.get(host, [])
for rule in rules:
if re.match(rule.regex_str, url):
return rule
storage could be redis, dict, mysql.......
key: host
value: rules_to_json()
from jmespath import compile as jc, search as js
from jsonpath_ng import parse
from objectpath import Tree
import timeit
JSON = {'a': {'a': 'a'}}
cc = jc('a.a')
pp = parse('$.a.a')
t = Tree(JSON)
def test1():
"jmespath compiled"
return cc.search(JSON)
def test2():
"jmespath uncompiled"
return js('a.a', JSON)
def test3():
"jsonpath_ng compiled"
return [i.value for i in pp.find(JSON)]
def test4():
"jsonpath_ng uncompiled"
return [i.value for i in parse('$.a.a').find(JSON)]
def test5():
"objectpath compiled"
return t.execute('$.a.a')
def test6():
"objectpath uncompiled"
t = Tree(JSON)
return t.execute('$.a.a')
num = 100000
print(test1.__doc__, ':', round(
timeit.timeit(test1, number=num) * 1000 / num, 3), 'ms')
print(test2.__doc__, ':', round(
timeit.timeit(test2, number=num) * 1000 / num, 3), 'ms')
print(test3.__doc__, ':', round(
timeit.timeit(test3, number=num) * 1000 / num, 3), 'ms')
print(test4.__doc__, ':',
round(timeit.timeit(test4, number=1000) * 1000 / 1000, 3), 'ms')
print(test5.__doc__, ':', round(
timeit.timeit(test5, number=num) * 1000 / num, 3), 'ms')
print(test6.__doc__, ':', round(
timeit.timeit(test6, number=num) * 1000 / num, 3), 'ms')
# jmespath compiled : 0.007 ms
# jmespath uncompiled : 0.009 ms
# jsonpath_ng compiled : 0.011 ms
# jsonpath_ng uncompiled : 12.536 ms
# objectpath compiled : 0.022 ms
# objectpath uncompiled : 0.021 ms
implement a lazy importer class
result_dict = {
'parse_rule': 'This is article title',
'__child__': {
'rule1': 'This is hello world',
'__child__': {
'rule2': 'dlrow olleh si sihT'
}
}
}
rule2 = result_obj.parse_rule.rule1.rule2
input_object:
[html-object, html-object, html-object]
output:
[{'text': 'xxx', 'href': 'xxxx'}, {'text': 'xxx', 'href': 'xxxx'}, {'text': 'xxx', 'href': 'xxxx'}]
Set context like a container for sharing variables
If context == None, do not set {}, until the parse_chain level
1.5.0 is not stable
using Jsonpath-rw-ext instead
from uniparser import HostRule
from uniparser.config import GlobalConfig
import orjson
def ordumps(*args, **kwargs):
return orjson.dumps(*args, **kwargs).decode('utf-8')
GlobalConfig.json_dumps = ordumps
rule = HostRule(
**{
'host': 'httpbin.org',
'crawler_rules': {
'HelloWorld': {
'name': 'HelloWorld',
'parse_rules': [{
'name': 'result',
'chain_rules': [['css', 'p', '$text'],
['python', 'getitem', '[0]']],
'child_rules': [],
'childs': ''
}],
'request_args': {
'method': 'get',
'url': 'http://httpbin.org/forms/post',
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
},
'regex': '',
'encoding': ''
}
}
})
print(rule.dumps())
# orjson.JSONEncodeError: Type is not JSON serializable: CrawlerRule
to share variables
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.