Giter Club home page Giter Club logo

uniparser's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

uniparser's Issues

Refactor the parse rule ignore node value if has child

Parse result

{
    'parse_rule': 'This is article title',
    '__child__': {
        'rule1': 'This is hello world',
        '__child__': {
            'rule2': 'dlrow olleh si sihT',
            'rule3': 'dlrow olleh si sihT'
        }
    }
}

change into

{'parse_rule': {'rule1': {'rule2': 'dlrow olleh si sihT', 'rule3': 'hello world'}}}

Documentary

Schema variables constraint

flow chart to describe the rule relationship

Add rule finder for url => rule

def find_rule(url, storage):
    host = urlparse(url).netloc
    rules = storage.get(host, [])
    for rule in rules:
        if re.match(rule.regex_str, url):
            return rule

storage could be redis, dict, mysql.......
key: host
value: rules_to_json()

Test three json parser performance

from jmespath import compile as jc, search as js
from jsonpath_ng import parse
from objectpath import Tree
import timeit

JSON = {'a': {'a': 'a'}}

cc = jc('a.a')
pp = parse('$.a.a')
t = Tree(JSON)


def test1():
    "jmespath compiled"
    return cc.search(JSON)


def test2():
    "jmespath uncompiled"
    return js('a.a', JSON)


def test3():
    "jsonpath_ng compiled"
    return [i.value for i in pp.find(JSON)]


def test4():
    "jsonpath_ng uncompiled"
    return [i.value for i in parse('$.a.a').find(JSON)]


def test5():
    "objectpath compiled"
    return t.execute('$.a.a')


def test6():
    "objectpath uncompiled"
    t = Tree(JSON)
    return t.execute('$.a.a')


num = 100000

print(test1.__doc__, ':', round(
    timeit.timeit(test1, number=num) * 1000 / num, 3), 'ms')
print(test2.__doc__, ':', round(
    timeit.timeit(test2, number=num) * 1000 / num, 3), 'ms')
print(test3.__doc__, ':', round(
    timeit.timeit(test3, number=num) * 1000 / num, 3), 'ms')
print(test4.__doc__, ':',
      round(timeit.timeit(test4, number=1000) * 1000 / 1000, 3), 'ms')
print(test5.__doc__, ':', round(
    timeit.timeit(test5, number=num) * 1000 / num, 3), 'ms')
print(test6.__doc__, ':', round(
    timeit.timeit(test6, number=num) * 1000 / num, 3), 'ms')
# jmespath compiled : 0.007 ms
# jmespath uncompiled : 0.009 ms
# jsonpath_ng compiled : 0.011 ms
# jsonpath_ng uncompiled : 12.536 ms
# objectpath compiled : 0.022 ms
# objectpath uncompiled : 0.021 ms

Create a class as parse result

result_dict = {
        'parse_rule': 'This is article title',
        '__child__': {
            'rule1': 'This is hello world',
            '__child__': {
                'rule2': 'dlrow olleh si sihT'
            }
        }
    }

rule2 = result_obj.parse_rule.rule1.rule2

shortage to parse a list one by one

input_object:
[html-object, html-object, html-object]

output:
[{'text': 'xxx', 'href': 'xxxx'}, {'text': 'xxx', 'href': 'xxxx'}, {'text': 'xxx', 'href': 'xxxx'}]

Refactor context strategy

Set context like a container for sharing variables
If context == None, do not set {}, until the parse_chain level

orjson dumps JSON serializable

from uniparser import HostRule
from uniparser.config import GlobalConfig
import orjson


def ordumps(*args, **kwargs):
    return orjson.dumps(*args, **kwargs).decode('utf-8')


GlobalConfig.json_dumps = ordumps

rule = HostRule(
    **{
        'host': 'httpbin.org',
        'crawler_rules': {
            'HelloWorld': {
                'name': 'HelloWorld',
                'parse_rules': [{
                    'name': 'result',
                    'chain_rules': [['css', 'p', '$text'],
                                    ['python', 'getitem', '[0]']],
                    'child_rules': [],
                    'childs': ''
                }],
                'request_args': {
                    'method': 'get',
                    'url': 'http://httpbin.org/forms/post',
                    'headers': {
                        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
                    }
                },
                'regex': '',
                'encoding': ''
            }
        }
    })
print(rule.dumps())
# orjson.JSONEncodeError: Type is not JSON serializable: CrawlerRule

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.