Giter Club home page Giter Club logo

inca's People

Contributors

annekroon avatar arno12 avatar bobvdvelde avatar boromir674 avatar chamoetal avatar damian0604 avatar dependabot[bot] avatar dirtyjackeen avatar feloe avatar jaerli avatar jelleboumans avatar laurasav avatar lisadk93 avatar mariekevh avatar nickma101 avatar noellelebe avatar pmitra01 avatar sebascole avatar tamarafelicity avatar theoaraujo avatar wlmwng avatar

Watchers

 avatar  avatar

inca's Issues

`action="batch"` in processor_class

Do you know what the correct way is to call INCA's processors with the batch action? I'm planning to use multiple processors and this option will be helpful since it'll be more efficient for sending HTTP requests to Elasticsearch (vs. run which sends a request per doc). I'm getting a name 'bulksize' is not defined error though when I pass bulksize=100 as an argument to the processor.

  • here's the calling code:
try:
    rules_fox = [
        {"regexp": "\\n[A-Z0-9 :,\\'!@\$\(\)\-\.\?\:\;]+(?:\\n|$)", "replace_with": ""},
        {
            "regexp": "Get all the latest news on coronavirus and more delivered daily to your inbox\. Sign up here",
            "replace_with": "",
        },
    ]

    # generator
    docs_regexp = myinca.processing.multireplace(
        docs_or_query="foxnews",
        field="article_maintext",
        rules=rules_fox,
        save=True,
        new_key="article_maintext_0",
        action="batch",
        batchsize=100,
        
    )
    for doc in docs_regexp:
        # runs process on doc
        # doesn't yield updated doc since it saves to db
        pass

except Exception as e:
    LOGGER.warning(e)

  • here's the warning message:
2021-11-02 07:42:55,030 - [INFO] - INCA - (processor_class.py)._doctype_query_or_list(228) - assuming documents of given type should be processed
2021-11-02 07:42:55,030 - [INFO] - INCA - (processor_class.py)._doctype_query_or_list(234) - force=False, ignoring documents where the result key exists (and has non-NULL value)
2021-11-02 07:42:55,030 - [WARNING] - main - (1275500693.py).<module>(27) - name 'bulksize' is not defined
  • I know it's related to this section of processor_class.py, but it'd be helpful if you could help me understand:
  1. how to get bulksize=100 to show up as a kwarg, and
  2. where target_func comes from
    def runwrap(
        self,
        docs_or_query,
        field,
        new_key=None,
        save=False,
        force=False,
        action="run",
        *args,
        **kwargs
    ):

    ...
        elif action == "batch":
            for num, batch in enumerate(_batcher(documents, batchsize=bulksize)):
                core.database.bulk_upsert().run(
                    documents=[
                        target_func.run(
                            document=doc, field=field, force=force, *args, **kwargs
                        )
                        for doc in batch
                    ]
                )
                now = datetime.datetime.now()
                logger.info("processed batch {num} {now}".format(**locals()))
                yield batch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.