marieai / marie-ai Goto Github PK

Integrate AI-powered Document Analysis Pipelines

License: MIT License

Python 95.13% Shell 0.68% Dockerfile 0.39% Makefile 0.02% Jupyter Notebook 0.07% HTML 0.11% C++ 1.00% Cuda 1.80% Go 0.75% C 0.04%

ocr icr optical-character-recognition intelligent-character-recognition intelligent-word-recognition iwr omr optical-mark-recognition docker document-layout-analysis

marie-ai's Introduction

Marie-AI

Integrate AI-powered document pipeline into your applications

Documentation

See the MarieAI docs.

Installation

You don't need this source code unless you want to modify the package. If you just want to use the package, just run:

pip install --upgrade marieai

Install from source with:

pip install -e .

Build docker container:

DOCKER_BUILDKIT=1 docker build . --build-arg PIP_TAG="standard" -f ./Dockerfiles/gpu.Dockerfile  -t marieai/marie:3.0-cuda

Command-line interface

This library additionally provides an marie command-line utility which makes it easy to interact with the API from your terminal. Run marie -h for usage.

Example code

Examples of how to use this library to accomplish various tasks can be found in the MarieAI documentation. It contains code examples for:

Document cleanup
Optical character recognition (OCR)
Document Classification
Document Splitter
Named Entity Recognition
Form detection
And more

Run with default entrypoint

docker run --rm  -it marieai/marie:3.0.19-cuda

Run the server with custom entrypoint

docker run --rm  -it --entrypoint /bin/bash  marieai/marie:3.0.19-cuda

Telemetry

https://telemetry.marieai.co/

TODO :MOVE TO DOCS

S3 Cloud Storage

docker compose -f  docker-compose.s3.yml --project-directory . up  --build --remove-orphans

CrossFTP

Configure AWS CLI Credentials.

vi ~/.aws/credentials
[marie] # this should be in the file
aws_access_key_id=your_access_key_id
aws_secret_access_key=your_secret_access_key

Pull the Docker image.

docker pull zenko/cloudserver

Create and start the container.

docker run --rm -it --name marie-s3-server -p 8000:8000 \
-e SCALITY_ACCESS_KEY_ID=MARIEACCESSKEY \
-e SCALITY_SECRET_ACCESS_KEY=MARIESECRETACCESSKEY \
-e S3DATA=multiple \
-e S3BACKEND=mem zenko/cloudserver

SCALITY_ACCESS_KEY_ID : Your AWS ACCESS KEY 
SCALITY_SECRET_ACCESS_KEY: Your AWS SECRET ACCESS KEY 
S3BACKEND: Currently using memory storage

Verify Installation.

aws s3 mb s3://mybucket  --profile marie --endpoint-url http://localhost:8000 --region us-west-2

aws s3 ls --profile marie --endpoint-url http://localhost:8000

aws s3 cp some_file.txt s3://mybucket  --profile marie --endpoint-url http://localhost:8000

aws s3 --profile marie --endpoint-url=http://127.0.0.1:8000 ls --recursive s3://

Production setup

Configuration for the S3 server will be stored in the following files: https://towardsdatascience.com/10-lessons-i-learned-training-generative-adversarial-networks-gans-for-a-year-c9071159628

marie-ai's People

Contributors

Stargazers

Watchers

Forkers

echaflo rithsek99

marie-ai's Issues

Add TensorFlow support

Core framework should support TensorFlow out off the start.

Excessive logging

There is excessive logging in the console and well as the file-system filling out the drive.

Implement TrOCR as alternative strategy

Implement following OCR strategy:

TrOCR: Transformer-based Optical Character Recognitionr

Handling multipage tiffs by page index

We need to be handling multipage tiffs by zero based page numbers. There are cases when we know that the data is on page N and we just want to process that page.

Here I like to add a new parameter pages to the request to handle processing multipage TIFFS/PDF by their page#

Examples
Current :
json_payload = {"data": base64_str, "mode": mode, "output": "json"}

New :
json_payload = {"data": base64_str, "mode": mode, "output": "json", "pages":"0,2,4,8"

Workarounds :

On client-side burst the document and process the page individually
Send whole document and process it as normal(wastes resources)

Add support for Intel Extension for PyTorch

Add Inte
https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f

Add Text rendering option with layout

Text rendering that allows us to keep layout for generate text document.

Assign specific GPU ID to OverlayProcessor

OverlayProcessor always get assigned default GPU ID 0. This needs to be configurable as when running on Multi-GPU this should use the correctly assigned one.

Source:

marie.overlay.overlay.OverlayProcessor 
@staticmethod 
def __setup(cuda: Any) -> Tuple[Any, BaseModel


 gpu_id = "0" if cuda else "-1"

Client failing when connecting via gRPC

Running client with gRPC protocol fails when executed against gRPC executor.

This can be replicated by executing 'test_gateway.py` integration test.

from marie import Client, DocumentArray

if __name__ == '__main__':
    c = Client(host='grpc://0.0.0.0:54321')
#    c = Client(host='http://0.0.0.0:54321')
    da = c.post('/aa', DocumentArray.empty(2))
    print(da.texts)

gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses 
       The ongoing request is terminated as the server is not available or closed already.                                 
Traceback (most recent call last):
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 220, in _get_results
    async for resp in self._stream_rpc(
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 78, in _stream_rpc
    async for resp in stub.Call(
  File "/home/gbugaj/environments/pytorch/lib/python3.8/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses
    await self._raise_for_status()
  File "/home/gbugaj/environments/pytorch/lib/python3.8/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status
    raise _create_rpc_error(await self.initial_metadata(), await
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1671656797.190254822","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671656797.190254501","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "client.py", line 6, in <module>
    da = c.post('/', DocumentArray.empty(2))
  File "/home/gbugaj/dev/marie-ai/marie/clients/mixin.py", line 275, in post
    return run_async(
  File "/home/gbugaj/dev/marie-ai/marie/helper.py", line 1345, in run_async
    return asyncio.run(func(*args, **kwargs))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/gbugaj/dev/marie-ai/marie/clients/mixin.py", line 266, in _get_results
    async for resp in c._get_results(*args, **kwargs):
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 259, in _get_results
    raise ConnectionError(my_details)
ConnectionError: failed to connect to all addresses

Add multipage tiff processing

There is limited mulipage tiff support , need to expand that to all methods available.

Implement Beam search

Implement beam search for improved accuracy.

pix2pix integration issues

In find_model_using_name there is an issue with the dynamic imports.

    model_filename = "models." + model_name + "_model"
    modellib = importlib.import_module(model_filename)

Workaround is to have them map directly

    # FIXME : this needs to be fixes as the import_module is not working
    mapping = {
        'test': TestModel,
        'base': BaseModel
    }

    return mapping[model_name]

I suspect this is due to nested model folder names being imported.

Implement webhooks

There is a need for webhook integration into the ICR server to allow for non message-driven communication.

Client   ------>  Async Processing
Client   <------  ACK
Client   <------  Done(Webook)

Implementation should support the following security mechanism :

Verification Token
Mutual TLS (Transport Layer Security)

Reference :
Best Practice to Secure your WebHooks
Zoom Webhooks

Exporting and Inference on ONNX models

To improve model performance during CPU inference we can convert the models for ONNX and then use if onnxruntime is available during inference time.

Following script check_onnx_runtime.py can be used to test the performance of the models.

Inference time Results

2400x2400 on Resnet50 model
PyTorch 3.6160961884500464 VS ONNX 2.131322395749976

1200x1200 on Resnet50 model
PyTorch 0.8162189463499999 VS ONNX 0.35815778665000836

512x512 on Resnet50 model
PyTorch 0.12735954449999554 VS ONNX 0.08733407934996648

This is a good implementation that we can base our work on.

Add webhook integration

Need to add webhook integration.

Endpoints :

Add a webhook
Get all webhooks
Update webhook
Delete webhook
Get all logs for requests sent out for a specific webhook

Monitoring console

There is a need for a job monitoring console that allows us to see what is going on in the system.

Add PAGE-XML rendering option

Add PAGE-XML compatible rendering options.

Github PAGE-XML
PAGE Viewer

Folder content provider connector

Custom content connector that monitors a folder and submits the request to the service.

Refactor project into recommended project structure

Restructure project into recommended project structure.

It should follow structure defined here

ref : https://github.com/pandas-dev/pandas

Implement Stable-Diffusion for document cleanup

Implement Stable-Diffusion for document cleanup POC

Whitelist IP address for gateways to prevent DOS

Need to implement custom middleware for the gateways to allow connections from whitelisted IP only.

Formats:

172.16.10.0/24
172.16.10.128

Implement TextFuseNet text box detection

TextFuseNet: Scene Text Detection with Richer Fused Features](https://github.com/ying09/TextFuseNet)

Convert from Flask to FastAPI

All endpoints need to be converted from Flask to FastAPI

Implement work scheduling and distribution

Currently basic load balancing is used for work distributions, this needs to be replaces with much more robust solution.
In original jina-ai work distribution is done via gRpc, this does work however it has number of limitations, this is where our implementation will differ significantly.

Objectives :

Predictable workflow (Directed acyclic graph)
Scheduling and Prioritization
Resilience to failure
Reliability

Implement status endpoint

Status endpoint
The server can return the following HTTP status codes:

200 – successful method call
4xx – incorrect parameters of the method
5xx – an error on the server-side
The body of the response contains JSON payload

Status Description

Submitted - The task has been registered in the system but has not yet been passed for processing.
Queued - The task has been placed in the processing queue and is waiting to be processed.
Inprogress - The task is being processed.
Completed - The task has been processed successfully.
Failed - The task has not been processed because an error occurred.
Deleted - The task has been deleted.

Remove dependency on jina-hubble-sdk in docarray

Need to remove dependency on jina-hubble-sdk from docarray as this ties directly into Jina Cloud Service.

Primary offending class:

pushpull.py

Unable to upload to bucket 'marie'

When we start with new instance of S3 we need to ensure that he bucket gets first created.

INFO   extract_t/rep-2@60 [2023-03-30 22:06:17,772]  Render PDF [True]:        
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/pdf/results.pdf        
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,106]  Render PDF [False]:       
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/pdf/results_clean.pdf  
Rendering blob : /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/blobs
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,119]  Render BLOBS to :         
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/blobs                  
Rendering adlib : /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/adlib
AdlibRenderer base : {}
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,126]  Render Adlib to :         
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/adlib                  
ERROR  MARIE@60 [2023-03-30 22:06:18,231]  Unable to upload to bucket 'marie'  
       : An error occurred (NoSuchBucket) when calling the PutObject           
       operation: The specified bucket does not exist.

Implement tracking via opentelemetry

Implement OpenTelemetry meter that can provide instruments for collecting metrics from each processor.

Add metrics

Implement metrics and APDEX (Application Performance Index)

Unable to store document

When the connection is lost to the Postgres Database from the NER Router we are unable to store the documents.

Possibly a retry should be in order here or have a monitor task that checks if the system is up and running.

connection already closed
2022-08-15 11:23:07,763, ERROR    [ner_router.py:ner_router:__store:88] Unable to store document

Merge jina-hubble-sdk

Need to merge jina-hubble-sdk directly into marie-ai.

Service discovery started to soon

Service is registering with Service Discovery too soon and starts accepting requests. Which causes number of exceptions to be thrown and start accepting connections to process requests.

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down."
	debug_error_string = "{"created":"@1675092791.749754057","description":"Error received from peer ipv4:0.0.0.0:52000","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down.","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.8/site-packages/marie_server/executors/ner/mserve_torch.py", line 32, in text_ner_post
    async for resp in c.post(
  File "/opt/venv/lib/python3.8/site-packages/marie/clients/mixin.py", line 358, in post
    async for result in c._get_results(
  File "/opt/venv/lib/python3.8/site-packages/marie/clients/base/grpc.py", line 271, in _get_results
    raise ConnectionError(my_details)
ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down.

Add support for loading models from google drive

Add support for loading models from google drive in the ModelRegistry

Possible syntax.

    ModelRegistry.register_provider("gdrive")

    _name_or_path = "rms/layoutlmv3-large-corr-ner"
    kwargs = {
                      "provider":"gdrive",
                       "__model_path__": __model_path__
    }
    _name_or_path = ModelRegistry.get_local_path(_name_or_path, **kwargs)

Tokens exceeds maximum length 512

Getting this error while training because token length exceeds max token length 512 while using LayoutLMv3 in NERExecutor. Currently, we are trimming the document token/boxes/words to the length of 512 however we need to be able to label the whole document that might exceed that.

Idea is to partition document tokens into 512 chunks and process them individually, this does pose a problem when a token label such as 'address' could be now broken apart.

Ensure that data for transission is always safely encoded when including Numpy types

Data needs to be safely encoded for transmission where there is a Numpy objects present.

Example return data will throw an Error:

        np_arr = np.array([1, 2, 3])

        out = [
            {"sample": 112, "complex": ["a", "b"]},
            {"sample": 112, "complex": ["a", "b"], "np_arr": np_arr},
        ]

Exception :

marie.excepts.BadServer: request_id: "04899407ec50441bb90444987b14f303"
status {
  code: ERROR
  description: "ValueError(\'Unexpected type\')"
  exception {
    name: "ValueError"
    args: "Unexpected type"
    stacks: "Traceback (most recent call last):\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/__init__.py\", line 265, in process_data\n    result = await self._request_handler.handle(\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/request_handling.py\", line 438, in handle\n    _ = self._set_result(requests, return_data, docs)\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/request_handling.py\", line 363, in _set_result\n    requests[0].parameters = params\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/types/request/data.py\", line 276, in parameters\n    self.proto_wo_data.parameters.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 746, in _SetStructValue\n    struct_value.struct_value.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 749, in _SetStructValue\n    struct_value.list_value.extend(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 838, in extend\n    self.append(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 834, in append\n    _SetStructValue(self.values.add(), value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 746, in _SetStructValue\n    struct_value.struct_value.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 751, in _SetStructValue\n    raise ValueError(\'Unexpected type\')\n"
    stacks: "ValueError: Unexpected type\n"
    executor: "ExtractExecutor"
  }
}

Proposed is @safely_encoded decorator that will get our data ready for transmission by converting Object to JSON and back to Object.

    @safely_encoded
    @requests(on="/status")
    def status(self, **kwargs):
        np_arr = np.array([1, 2, 3])
        out = [
            {"sample": 112, "complex": ["a", "b"]},
            {"sample": 112, "complex": ["a", "b"], "np_arr": np_arr},
        ]
        return out

Content retrieval plugin system

In order to support different storage providers there is a need for a plugin system .

Cuda Error: Out of memory

This needs to be handled better.

 File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 701, in preprocess
    ocr_results, frames = obtain_ocr(src_image, self.text_executor)
  File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 78, in obtain_ocr
    results = text_executor.extract(docs, **kwa)
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 369, in extract
    logger.error("Extract error", error)
Message: 'Extract error'
Arguments: (RuntimeError('CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 358.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'),)
--- Logging error ---
Traceback (most recent call last):
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 346, in extract
    results = self.__process_extract_fullpage(
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 164, in __process_extract_fullpage
    result, overlay_image = self.icr_processor.recognize(
  File "/opt/marie-icr/marie/document/icr_processor.py", line 250, in recognize
    raise ex
  File "/opt/marie-icr/marie/document/icr_processor.py", line 119, in recognize
    results = self.recognize_from_fragments(fragments)
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 251, in recognize_from_fragments
    raise ex
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 232, in recognize_from_fragments
    predictions, scores = get_text(
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 122, in get_text
    results = task.inference_step(

2022-08-15 09:34:20,276 DEBG 'wsgi-app' stdout output:
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/tasks/fairseq_task.py", line 542, in inference_step
    return generator.generate(
  File "/opt/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 204, in generate
    return self._generate(sample, **kwargs)
  File "/opt/marie-icr/marie/models/unilm/trocr/generator.py", line 144, in _generate
    lprobs, avg_attn_scores = self.model.forward_decoder(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 819, in forward_decoder
    decoder_out = model.decoder.forward(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 217, in forward
    x, extra = self.extract_features(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 239, in extract_features
    return self.extract_features_scriptable(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 340, in extract_features_scriptable
    x, layer_attn, _ = layer(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/transformer_layer.py", line 487, in forward
    x, attn = self.encoder_attn(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/multihead_attention.py", line 593, in forward
    k = self.k_proj(key)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 356.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Duplicate service registry checks

When service is registered with Consul and then restarted there will be a UUID assigned to the service which causes duplicate service checks.

traefik-system-ingress@5100#275615fa-86ce-4014-bbe0-ee0078862d47

Prevent downloading assets multiple times

There are assets that are being downloaded multiple times as per this log. This should be downloaded only once per run.

ro_sharding='none') roberta2 2022-04-26 13:18:13 | INFO | models.unilm.trocr.task | Load gpt2 dictionary from https://layoutlm.blob.core.windows.net/trocr/dictionaries/gpt2_with_mask.dict.txt 2022-04-26 13:18:15 | INFO | models.unilm.trocr.task | [label] load dictionary: 50265 types 2022-04-26 13:18:15 | INFO | fairseq.file_utils | https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json not found in cache, downloading to /tmp/tmpgnpzp46x 1042301B [00:00, 2283543.33B/s] 2022-04-26 13:18:16 | INFO | fairseq.file_utils | copying /tmp/tmpgnpzp46x to cache at /home/app-svc/.cache/torch/pytorch_fairseq/e2aab4d600e7568c2d88fc7732130ccc815ea84ec63906cb0913c7a3a4906a2e.0f323dfaed92d080380e63f0291d0f31adfa8c61a62cbcb3cb8114f061be27f7 2022-04-26 13:18:16 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/e2aab4d600e7568c2d88fc7732130ccc815ea84ec63906cb0913c7a3a4906a2e.0f323dfaed92d080380e63f0291d0f31adfa8c61a62cbcb3cb8114f061be27f7 2022-04-26 13:18:16 | INFO | fairseq.file_utils | removing temp file /tmp/tmpgnpzp46x 2022-04-26 13:18:16 | INFO | fairseq.file_utils | https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe not found in cache, downloading to /tmp/tmpb9t7i28q 456318B [00:00, 1740809.43B/s] 2022-04-26 13:18:17 | INFO | fairseq.file_utils | copying /tmp/tmpb9t7i28q to cache at /home/app-svc/.cache/torch/pytorch_fairseq/b04a6d337c09f464fe8f0df1d3524db88a597007d63f05d97e437f65840cdba5.939bed25cbdab15712bac084ee713d6c78e221c5156c68cb0076b03f5170600f 2022-04-26 13:18:17 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/b04a6d337c09f464fe8f0df1d3524db88a597007d63f05d97e437f65840cdba5.939bed25cbdab15712bac084ee713d6c78e221c5156c68cb0076b03f5170600f 2022-04-26 13:18:17 | INFO | fairseq.file_utils | removing temp file /tmp/tmpb9t7i28q 2022-04-26 13:18:20 | INFO | models.unilm.trocr.deit_models | Using the learned pos embedding version loading roberta. 2022-04-26 13:18:20 | INFO | models.unilm.trocr.deit_models | Load pre-trained decoder parameters from roberta.large Downloading: "https://github.com/pytorch/fairseq/archive/main.zip" to /home/app-svc/.cache/torch/hub/main.zip 2022-04-26 13:18:24 | INFO | fairseq.file_utils | http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz not found in cache, downloading to /tmp/tmpi8_krgek 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 655283069/655283069 [01:01<00:00, 10669018.59B/s] 2022-04-26 13:19:25 | INFO | fairseq.file_utils | copying /tmp/tmpi8_krgek to cache at /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:25 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:25 | INFO | fairseq.file_utils | removing temp file /tmp/tmpi8_krgek 2022-04-26 13:19:26 | INFO | fairseq.file_utils | loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz from cache at /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:26 | INFO | fairseq.file_utils | extracting archive file /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 to temp dir /tmp/tmpq4hr_lrz

Integrate service discovery

Port existing service discovery that have been implemented via register.py into the new Gateway / Flow framework.

Add support for vertical named entity aggregration

Based on the discussion here How to cluster words into semantic entities, when performing information extraction? #92 need to add support for vertical named entity aggregation.

Add Page Segmentation Mode

Add support for page segmentation mode. this will be similar to functionality found in Tesseract

Page segmentation modes

sparse Find as much text as possible (default).
word Treat the image as a single word.
line Treat the image as a single text line.

Add ABBYY-XML compatible rendering options

Add ABBYY-XML compatible rendering options.

ABBYY FineReader Engine & XML Export

Implement health check registry

Implement health check registry.

Registry needs to support following types:

http
script
tcp
sql

Example configuration:

monitor.json

{
  "shell" :"/bin/bash",
  "args": [
    "./monitor/memory.sh"
  ],
  "id": "pan.memory.0",
  "name": "pan.memory",
  "interval": "PT1M",
  "ttl": "PT1M",
  "timeout": "PT10S",
  "tags": [
    "agent",
    "system",
    "pan",
    "memory"
  ],

    "webhooks": [
      {
        "name": "",
        "uri": "",
        "payload": "",
      }
    ],

  "__type__": "script"
}

Webhooks: If any health check returns a Failure result, this collections will be used to notify the error status. (Payload is the json payload and must be escaped.)

memory.sh

#!/bin/bash

process="SAMPLE"

# percentage
failureval=95
critivalval=90

total_memory=$(free | grep Mem: | awk '{print $2}')
used_memory=$(free | grep buffers/cache: | awk '{print $3}')
memory_use=`echo $used_memory $total_memory | awk '{print $1 / $2 * 100}'`
memory_use=${memory_use%%.*}

printf "Checking $process memory on $(hostname -i)... "

if [ "$memory_use" -ge "$failureval" ]; then
    echo "FAILED. $memory_use%."
    exit 2
elif [ "$memory_use" -ge "$critivalval" ]; then
    echo "CRITICAL. $memory_use%."
    exit 1
elif [ "$memory_use" -le "$critivalval" ]; then
    echo "PASSED. $memory_use%."
    exit 0
fi

echo "Memory use could not be determined."
exit 2

Shell scripts will have following error codes:
0 = Passed
1 = Critical
2 = Failed

Example:

if [ "$memory_use" -ge "$failureval" ]; then
    echo "FAILED. $memory_use%."
    exit 2
elif [ "$memory_use" -ge "$critivalval" ]; then
    echo "CRITICAL. $memory_use%."
    exit 1
elif [ "$memory_use" -le "$critivalval" ]; then
    echo "PASSED. $memory_use%."
    exit 0
fi

DiT for Text Detection

Implement DiT for Text Detection
This should include text box, line detection and all page segmentation models.

Example Usage

    box = BoxProcessorUlimDit(
        work_dir=work_dir_boxes,
        models_dir="./model_zoo/unilm/dit/text_detection",
        cuda=True,
    )

    (
        boxes,
        fragments,
        lines,
        _,
        lines_bboxes,
    ) = box.extract_bounding_boxes(key, "field", image, PSMode.SPARSE)

Invalid type conversion

When sending request to the backend via following method:

    async def __process(client: Client, input_docs, parameters):
        payload = {}
        async for resp in client.post(
            '/text/extract',
            input_docs,
            request_size=-1,
            parameters=parameters,
            return_responses=True,
        ):
            payload = parse_response_to_payload(resp)
        return payload

There is a type conversion happening that performs conversion from int to float this is problematic when expecting an index to access arrays or for IDs in general.

Payload will be modified in following way:

   {'engine': 'BEST', 'regions': [{'y': 828.0, 'h': 36.0, 'x':        
       1661.0, 'pageIndex': 1.0, 'id': 9359800610.0, 'w': 551.0}, {'x':        
       1614.0, 'w': 601.0, 'y': 691.0, 'id': 9359800604.0, 'pageIndex': 1.0,   
       'h': 33.0}], 'pipeline': {'preprocessors': []}, 'srcBase64':

Here the 'pageIndex': 1.0 have been converted to a float

Exception:

       ╭───────────────── Traceback (most recent call last) ─────────────────╮ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in extract                                                          │ 
       │                                                                     │ 
       │   266 │   │   │   │   │   ro_frames, queue_id, checksum, pms_mode,  │ 
       │   267 │   │   │   │   )                                             │ 
       │   268 │   │   │   else:                                             │ 
       │ ❱ 269 │   │   │   │   results = self.__process_extract_regions(     │ 
       │   270 │   │   │   │   │   ro_frames, queue_id, checksum, pms_mode,  │ 
       │   271 │   │   │   │   )                                             │ 
       │   272                                                               │ 
       │                                                                     │ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in __process_extract_regions                                        │ 
       │                                                                     │ 
       │   203 │   │   │   │   │   │   output.append(region_result)          │ 
       │   204 │   │   │   except Exception as ex:                           │ 
       │   205 │   │   │   │   self.logger.error(ex)                         │ 
       │ ❱ 206 │   │   │   │   raise ex                                      │ 
       │   207 │   │                                                         │ 
       │   208 │   │   # Filter out base 64 encoded fragments(fragment_b64,  │ 
       │   209 │   │   # This is useful when we like to display or process i │ 
       │       significant payload overhead                                  │ 
       │                                                                     │ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in __process_extract_regions                                        │ 
       │                                                                     │ 
       │   154 │   │   │   │   w = region["w"]                               │ 
       │   155 │   │   │   │   h = region["h"]                               │ 
       │   156 │   │   │   │                                                 │ 
       │ ❱ 157 │   │   │   │   img = frames[page_index]                      │ 
       │   158 │   │   │   │   img = img[y : y + h, x : x + w].copy()        │ 
       │   159 │   │   │   │   overlay = img                                 │ 
       │   160                                                               │ 
       ╰─────────────────────────────────────────────────────────────────────╯

Implement message queue service

In order to communicate in WAN environments we need to utilize some kind of message queue service.

Possible providers :

Amazon Simple Queue Service (SQS)
Amazon Message Queue
Google pubsub
RabbitMQ Shovel

Document Partitioning Scheme

Support to add document partitioning scheme where large page documents can per partitioned and then processed on multiple nodes.

Convert project to monorepo

Monorepo project will provide

Shared code and visibility
Atomic changes

Add JAX support

Core framework should support JAX framework out off the start.