Giter Club home page Giter Club logo

marie-ai's Introduction

Code style: black

Marie-AI

Integrate AI-powered document pipeline into your applications

Documentation

See the MarieAI docs.

Installation

You don't need this source code unless you want to modify the package. If you just want to use the package, just run:

pip install --upgrade marieai

Install from source with:

pip install -e .

Build docker container:

DOCKER_BUILDKIT=1 docker build . --build-arg PIP_TAG="standard" -f ./Dockerfiles/gpu.Dockerfile  -t marieai/marie:3.0-cuda 

Command-line interface

This library additionally provides an marie command-line utility which makes it easy to interact with the API from your terminal. Run marie -h for usage.

Example code

Examples of how to use this library to accomplish various tasks can be found in the MarieAI documentation. It contains code examples for:

  • Document cleanup
  • Optical character recognition (OCR)
  • Document Classification
  • Document Splitter
  • Named Entity Recognition
  • Form detection
  • And more

Run with default entrypoint

docker run --rm  -it marieai/marie:3.0.19-cuda

Run the server with custom entrypoint

docker run --rm  -it --entrypoint /bin/bash  marieai/marie:3.0.19-cuda  

Telemetry

https://telemetry.marieai.co/

TODO :MOVE TO DOCS

S3 Cloud Storage

docker compose -f  docker-compose.s3.yml --project-directory . up  --build --remove-orphans

CrossFTP

Configure AWS CLI Credentials.

vi ~/.aws/credentials
[marie] # this should be in the file
aws_access_key_id=your_access_key_id
aws_secret_access_key=your_secret_access_key

Pull the Docker image.

docker pull zenko/cloudserver

Create and start the container.

docker run --rm -it --name marie-s3-server -p 8000:8000 \
-e SCALITY_ACCESS_KEY_ID=MARIEACCESSKEY \
-e SCALITY_SECRET_ACCESS_KEY=MARIESECRETACCESSKEY \
-e S3DATA=multiple \
-e S3BACKEND=mem zenko/cloudserver
SCALITY_ACCESS_KEY_ID : Your AWS ACCESS KEY 
SCALITY_SECRET_ACCESS_KEY: Your AWS SECRET ACCESS KEY 
S3BACKEND: Currently using memory storage

Verify Installation.

aws s3 mb s3://mybucket  --profile marie --endpoint-url http://localhost:8000 --region us-west-2
aws s3 ls --profile marie --endpoint-url http://localhost:8000
aws s3 cp some_file.txt s3://mybucket  --profile marie --endpoint-url http://localhost:8000
aws s3 --profile marie --endpoint-url=http://127.0.0.1:8000 ls --recursive s3://

Production setup

Configuration for the S3 server will be stored in the following files: https://towardsdatascience.com/10-lessons-i-learned-training-generative-adversarial-networks-gans-for-a-year-c9071159628

marie-ai's People

Contributors

fuseraft avatar gbugaj avatar gregbugaj avatar rithsek99 avatar rsteele5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

echaflo rithsek99

marie-ai's Issues

Excessive logging

There is excessive logging in the console and well as the file-system filling out the drive.

Handling multipage tiffs by page index

We need to be handling multipage tiffs by zero based page numbers. There are cases when we know that the data is on page N and we just want to process that page.

Here I like to add a new parameter pages to the request to handle processing multipage TIFFS/PDF by their page#

Examples
Current :
json_payload = {"data": base64_str, "mode": mode, "output": "json"}

New :
json_payload = {"data": base64_str, "mode": mode, "output": "json", "pages":"0,2,4,8"

Workarounds :

  • On client-side burst the document and process the page individually
  • Send whole document and process it as normal(wastes resources)

Assign specific GPU ID to OverlayProcessor

OverlayProcessor always get assigned default GPU ID 0. This needs to be configurable as when running on Multi-GPU this should use the correctly assigned one.

Source:

marie.overlay.overlay.OverlayProcessor 
@staticmethod 
def __setup(cuda: Any) -> Tuple[Any, BaseModel


 gpu_id = "0" if cuda else "-1"

Client failing when connecting via gRPC

Running client with gRPC protocol fails when executed against gRPC executor.

This can be replicated by executing 'test_gateway.py` integration test.

from marie import Client, DocumentArray

if __name__ == '__main__':
    c = Client(host='grpc://0.0.0.0:54321')
#    c = Client(host='http://0.0.0.0:54321')
    da = c.post('/aa', DocumentArray.empty(2))
    print(da.texts)

gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses 
       The ongoing request is terminated as the server is not available or closed already.                                 
Traceback (most recent call last):
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 220, in _get_results
    async for resp in self._stream_rpc(
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 78, in _stream_rpc
    async for resp in stub.Call(
  File "/home/gbugaj/environments/pytorch/lib/python3.8/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses
    await self._raise_for_status()
  File "/home/gbugaj/environments/pytorch/lib/python3.8/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status
    raise _create_rpc_error(await self.initial_metadata(), await
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1671656797.190254822","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1671656797.190254501","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "client.py", line 6, in <module>
    da = c.post('/', DocumentArray.empty(2))
  File "/home/gbugaj/dev/marie-ai/marie/clients/mixin.py", line 275, in post
    return run_async(
  File "/home/gbugaj/dev/marie-ai/marie/helper.py", line 1345, in run_async
    return asyncio.run(func(*args, **kwargs))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/gbugaj/dev/marie-ai/marie/clients/mixin.py", line 266, in _get_results
    async for resp in c._get_results(*args, **kwargs):
  File "/home/gbugaj/dev/marie-ai/marie/clients/base/grpc.py", line 259, in _get_results
    raise ConnectionError(my_details)
ConnectionError: failed to connect to all addresses

pix2pix integration issues

In find_model_using_name there is an issue with the dynamic imports.

    model_filename = "models." + model_name + "_model"
    modellib = importlib.import_module(model_filename)

Workaround is to have them map directly

    # FIXME : this needs to be fixes as the import_module is not working
    mapping = {
        'test': TestModel,
        'base': BaseModel
    }

    return mapping[model_name]

I suspect this is due to nested model folder names being imported.

Implement webhooks

There is a need for webhook integration into the ICR server to allow for non message-driven communication.

Client   ------>  Async Processing
Client   <------  ACK
Client   <------  Done(Webook)   

Implementation should support the following security mechanism :

  • Verification Token
  • Mutual TLS (Transport Layer Security)

Reference :
Best Practice to Secure your WebHooks
Zoom Webhooks

Exporting and Inference on ONNX models

To improve model performance during CPU inference we can convert the models for ONNX and then use if onnxruntime is available during inference time.

Following script check_onnx_runtime.py can be used to test the performance of the models.

Inference time Results

2400x2400 on Resnet50 model
PyTorch 3.6160961884500464 VS ONNX 2.131322395749976
image

1200x1200 on Resnet50 model
PyTorch 0.8162189463499999 VS ONNX 0.35815778665000836
image

512x512 on Resnet50 model
PyTorch 0.12735954449999554 VS ONNX 0.08733407934996648
image

This is a good implementation that we can base our work on.

Add webhook integration

Need to add webhook integration.

Endpoints :

  • Add a webhook
  • Get all webhooks
  • Update webhook
  • Delete webhook
  • Get all logs for requests sent out for a specific webhook

Monitoring console

There is a need for a job monitoring console that allows us to see what is going on in the system.

Implement work scheduling and distribution

Currently basic load balancing is used for work distributions, this needs to be replaces with much more robust solution.
In original jina-ai work distribution is done via gRpc, this does work however it has number of limitations, this is where our implementation will differ significantly.

Objectives :

  • Predictable workflow (Directed acyclic graph)
  • Scheduling and Prioritization
  • Resilience to failure
  • Reliability

Implement status endpoint

Status endpoint
The server can return the following HTTP status codes:

200 – successful method call
4xx – incorrect parameters of the method
5xx – an error on the server-side
The body of the response contains JSON payload

Status Description

  • Submitted - The task has been registered in the system but has not yet been passed for processing.
  • Queued - The task has been placed in the processing queue and is waiting to be processed.
  • Inprogress - The task is being processed.
  • Completed - The task has been processed successfully.
  • Failed - The task has not been processed because an error occurred.
  • Deleted - The task has been deleted.

Unable to upload to bucket 'marie'

When we start with new instance of S3 we need to ensure that he bucket gets first created.

INFO   extract_t/rep-2@60 [2023-03-30 22:06:17,772]  Render PDF [True]:        
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/pdf/results.pdf        
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,106]  Render PDF [False]:       
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/pdf/results_clean.pdf  
Rendering blob : /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/blobs
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,119]  Render BLOBS to :         
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/blobs                  
Rendering adlib : /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/adlib
AdlibRenderer base : {}
INFO   extract_t/rep-2@60 [2023-03-30 22:06:18,126]  Render Adlib to :         
       /tmp/generators/a802ca0aef145b40bbfbf3d8096b681c/adlib                  
ERROR  MARIE@60 [2023-03-30 22:06:18,231]  Unable to upload to bucket 'marie'  
       : An error occurred (NoSuchBucket) when calling the PutObject           
       operation: The specified bucket does not exist.         

Add metrics

Implement metrics and APDEX (Application Performance Index)

Unable to store document

When the connection is lost to the Postgres Database from the NER Router we are unable to store the documents.

Possibly a retry should be in order here or have a monitor task that checks if the system is up and running.

connection already closed
2022-08-15 11:23:07,763, ERROR    [ner_router.py:ner_router:__store:88] Unable to store document

Service discovery started to soon

Service is registering with Service Discovery too soon and starts accepting requests. Which causes number of exceptions to be thrown and start accepting connections to process requests.

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down."
	debug_error_string = "{"created":"@1675092791.749754057","description":"Error received from peer ipv4:0.0.0.0:52000","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down.","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.8/site-packages/marie_server/executors/ner/mserve_torch.py", line 32, in text_ner_post
    async for resp in c.post(
  File "/opt/venv/lib/python3.8/site-packages/marie/clients/mixin.py", line 358, in post
    async for result in c._get_results(
  File "/opt/venv/lib/python3.8/site-packages/marie/clients/base/grpc.py", line 271, in _get_results
    raise ConnectionError(my_details)
ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment ner_t at address(es) {'0.0.0.0:61970', '0.0.0.0:60456', '0.0.0.0:51350', '0.0.0.0:58728'}. Head or worker(s) may be down.


Add support for loading models from google drive

Add support for loading models from google drive in the ModelRegistry

Possible syntax.

    ModelRegistry.register_provider("gdrive")

    _name_or_path = "rms/layoutlmv3-large-corr-ner"
    kwargs = {
                      "provider":"gdrive",
                       "__model_path__": __model_path__
    }
    _name_or_path = ModelRegistry.get_local_path(_name_or_path, **kwargs)

Tokens exceeds maximum length 512

Getting this error while training because token length exceeds max token length 512 while using LayoutLMv3 in NERExecutor. Currently, we are trimming the document token/boxes/words to the length of 512 however we need to be able to label the whole document that might exceed that.

Idea is to partition document tokens into 512 chunks and process them individually, this does pose a problem when a token label such as 'address' could be now broken apart.

Ensure that data for transission is always safely encoded when including Numpy types

Data needs to be safely encoded for transmission where there is a Numpy objects present.

Example return data will throw an Error:

        np_arr = np.array([1, 2, 3])

        out = [
            {"sample": 112, "complex": ["a", "b"]},
            {"sample": 112, "complex": ["a", "b"], "np_arr": np_arr},
        ]

Exception :

marie.excepts.BadServer: request_id: "04899407ec50441bb90444987b14f303"
status {
  code: ERROR
  description: "ValueError(\'Unexpected type\')"
  exception {
    name: "ValueError"
    args: "Unexpected type"
    stacks: "Traceback (most recent call last):\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/__init__.py\", line 265, in process_data\n    result = await self._request_handler.handle(\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/request_handling.py\", line 438, in handle\n    _ = self._set_result(requests, return_data, docs)\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/serve/runtimes/worker/request_handling.py\", line 363, in _set_result\n    requests[0].parameters = params\n"
    stacks: "  File \"/dev/marieai/marie-ai/marie/types/request/data.py\", line 276, in parameters\n    self.proto_wo_data.parameters.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 746, in _SetStructValue\n    struct_value.struct_value.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 749, in _SetStructValue\n    struct_value.list_value.extend(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 838, in extend\n    self.append(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 834, in append\n    _SetStructValue(self.values.add(), value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 746, in _SetStructValue\n    struct_value.struct_value.update(value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 820, in update\n    _SetStructValue(self.fields[key], value)\n"
    stacks: "  File \"/dev/marieai/marie-as-service/venv/lib/python3.8/site-packages/google/protobuf/internal/well_known_types.py\", line 751, in _SetStructValue\n    raise ValueError(\'Unexpected type\')\n"
    stacks: "ValueError: Unexpected type\n"
    executor: "ExtractExecutor"
  }
}

Proposed is @safely_encoded decorator that will get our data ready for transmission by converting Object to JSON and back to Object.

    @safely_encoded
    @requests(on="/status")
    def status(self, **kwargs):
        np_arr = np.array([1, 2, 3])
        out = [
            {"sample": 112, "complex": ["a", "b"]},
            {"sample": 112, "complex": ["a", "b"], "np_arr": np_arr},
        ]
        return out

Cuda Error: Out of memory

This needs to be handled better.

 File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 701, in preprocess
    ocr_results, frames = obtain_ocr(src_image, self.text_executor)
  File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 78, in obtain_ocr
    results = text_executor.extract(docs, **kwa)
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 369, in extract
    logger.error("Extract error", error)
Message: 'Extract error'
Arguments: (RuntimeError('CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 358.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'),)
--- Logging error ---
Traceback (most recent call last):
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 346, in extract
    results = self.__process_extract_fullpage(
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 164, in __process_extract_fullpage
    result, overlay_image = self.icr_processor.recognize(
  File "/opt/marie-icr/marie/document/icr_processor.py", line 250, in recognize
    raise ex
  File "/opt/marie-icr/marie/document/icr_processor.py", line 119, in recognize
    results = self.recognize_from_fragments(fragments)
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 251, in recognize_from_fragments
    raise ex
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 232, in recognize_from_fragments
    predictions, scores = get_text(
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 122, in get_text
    results = task.inference_step(

2022-08-15 09:34:20,276 DEBG 'wsgi-app' stdout output:
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/tasks/fairseq_task.py", line 542, in inference_step
    return generator.generate(
  File "/opt/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 204, in generate
    return self._generate(sample, **kwargs)
  File "/opt/marie-icr/marie/models/unilm/trocr/generator.py", line 144, in _generate
    lprobs, avg_attn_scores = self.model.forward_decoder(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 819, in forward_decoder
    decoder_out = model.decoder.forward(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 217, in forward
    x, extra = self.extract_features(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 239, in extract_features
    return self.extract_features_scriptable(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 340, in extract_features_scriptable
    x, layer_attn, _ = layer(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/transformer_layer.py", line 487, in forward
    x, attn = self.encoder_attn(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/multihead_attention.py", line 593, in forward
    k = self.k_proj(key)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 356.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF


Duplicate service registry checks

When service is registered with Consul and then restarted there will be a UUID assigned to the service which causes duplicate service checks.

traefik-system-ingress@5100#275615fa-86ce-4014-bbe0-ee0078862d47

Prevent downloading assets multiple times

There are assets that are being downloaded multiple times as per this log. This should be downloaded only once per run.

ro_sharding='none') roberta2 2022-04-26 13:18:13 | INFO | models.unilm.trocr.task | Load gpt2 dictionary from https://layoutlm.blob.core.windows.net/trocr/dictionaries/gpt2_with_mask.dict.txt 2022-04-26 13:18:15 | INFO | models.unilm.trocr.task | [label] load dictionary: 50265 types 2022-04-26 13:18:15 | INFO | fairseq.file_utils | https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json not found in cache, downloading to /tmp/tmpgnpzp46x 1042301B [00:00, 2283543.33B/s] 2022-04-26 13:18:16 | INFO | fairseq.file_utils | copying /tmp/tmpgnpzp46x to cache at /home/app-svc/.cache/torch/pytorch_fairseq/e2aab4d600e7568c2d88fc7732130ccc815ea84ec63906cb0913c7a3a4906a2e.0f323dfaed92d080380e63f0291d0f31adfa8c61a62cbcb3cb8114f061be27f7 2022-04-26 13:18:16 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/e2aab4d600e7568c2d88fc7732130ccc815ea84ec63906cb0913c7a3a4906a2e.0f323dfaed92d080380e63f0291d0f31adfa8c61a62cbcb3cb8114f061be27f7 2022-04-26 13:18:16 | INFO | fairseq.file_utils | removing temp file /tmp/tmpgnpzp46x 2022-04-26 13:18:16 | INFO | fairseq.file_utils | https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe not found in cache, downloading to /tmp/tmpb9t7i28q 456318B [00:00, 1740809.43B/s] 2022-04-26 13:18:17 | INFO | fairseq.file_utils | copying /tmp/tmpb9t7i28q to cache at /home/app-svc/.cache/torch/pytorch_fairseq/b04a6d337c09f464fe8f0df1d3524db88a597007d63f05d97e437f65840cdba5.939bed25cbdab15712bac084ee713d6c78e221c5156c68cb0076b03f5170600f 2022-04-26 13:18:17 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/b04a6d337c09f464fe8f0df1d3524db88a597007d63f05d97e437f65840cdba5.939bed25cbdab15712bac084ee713d6c78e221c5156c68cb0076b03f5170600f 2022-04-26 13:18:17 | INFO | fairseq.file_utils | removing temp file /tmp/tmpb9t7i28q 2022-04-26 13:18:20 | INFO | models.unilm.trocr.deit_models | Using the learned pos embedding version loading roberta. 2022-04-26 13:18:20 | INFO | models.unilm.trocr.deit_models | Load pre-trained decoder parameters from roberta.large Downloading: "https://github.com/pytorch/fairseq/archive/main.zip" to /home/app-svc/.cache/torch/hub/main.zip 2022-04-26 13:18:24 | INFO | fairseq.file_utils | http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz not found in cache, downloading to /tmp/tmpi8_krgek 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 655283069/655283069 [01:01<00:00, 10669018.59B/s] 2022-04-26 13:19:25 | INFO | fairseq.file_utils | copying /tmp/tmpi8_krgek to cache at /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:25 | INFO | fairseq.file_utils | creating metadata file for /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:25 | INFO | fairseq.file_utils | removing temp file /tmp/tmpi8_krgek 2022-04-26 13:19:26 | INFO | fairseq.file_utils | loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz from cache at /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 2022-04-26 13:19:26 | INFO | fairseq.file_utils | extracting archive file /home/app-svc/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 to temp dir /tmp/tmpq4hr_lrz

Integrate service discovery

Port existing service discovery that have been implemented via register.py into the new Gateway / Flow framework.

Add Page Segmentation Mode

Add support for page segmentation mode. this will be similar to functionality found in Tesseract

Page segmentation modes

  • sparse Find as much text as possible (default).
  • word Treat the image as a single word.
  • line Treat the image as a single text line.

Implement health check registry

Implement health check registry.

Registry needs to support following types:

  • http
  • script
  • tcp
  • sql

Example configuration:

monitor.json

{
  "shell" :"/bin/bash",
  "args": [
    "./monitor/memory.sh"
  ],
  "id": "pan.memory.0",
  "name": "pan.memory",
  "interval": "PT1M",
  "ttl": "PT1M",
  "timeout": "PT10S",
  "tags": [
    "agent",
    "system",
    "pan",
    "memory"
  ],

    "webhooks": [
      {
        "name": "",
        "uri": "",
        "payload": "",
      }
    ],

  "__type__": "script"
}

Webhooks: If any health check returns a Failure result, this collections will be used to notify the error status. (Payload is the json payload and must be escaped.)

memory.sh

#!/bin/bash

process="SAMPLE"

# percentage
failureval=95
critivalval=90

total_memory=$(free | grep Mem: | awk '{print $2}')
used_memory=$(free | grep buffers/cache: | awk '{print $3}')
memory_use=`echo $used_memory $total_memory | awk '{print $1 / $2 * 100}'`
memory_use=${memory_use%%.*}

printf "Checking $process memory on $(hostname -i)... "

if [ "$memory_use" -ge "$failureval" ]; then
    echo "FAILED. $memory_use%."
    exit 2
elif [ "$memory_use" -ge "$critivalval" ]; then
    echo "CRITICAL. $memory_use%."
    exit 1
elif [ "$memory_use" -le "$critivalval" ]; then
    echo "PASSED. $memory_use%."
    exit 0
fi

echo "Memory use could not be determined."
exit 2

Shell scripts will have following error codes:
0 = Passed
1 = Critical
2 = Failed

Example:

if [ "$memory_use" -ge "$failureval" ]; then
    echo "FAILED. $memory_use%."
    exit 2
elif [ "$memory_use" -ge "$critivalval" ]; then
    echo "CRITICAL. $memory_use%."
    exit 1
elif [ "$memory_use" -le "$critivalval" ]; then
    echo "PASSED. $memory_use%."
    exit 0
fi

DiT for Text Detection

Implement DiT for Text Detection
This should include text box, line detection and all page segmentation models.

Example Usage

    box = BoxProcessorUlimDit(
        work_dir=work_dir_boxes,
        models_dir="./model_zoo/unilm/dit/text_detection",
        cuda=True,
    )

    (
        boxes,
        fragments,
        lines,
        _,
        lines_bboxes,
    ) = box.extract_bounding_boxes(key, "field", image, PSMode.SPARSE)

Invalid type conversion

When sending request to the backend via following method:

    async def __process(client: Client, input_docs, parameters):
        payload = {}
        async for resp in client.post(
            '/text/extract',
            input_docs,
            request_size=-1,
            parameters=parameters,
            return_responses=True,
        ):
            payload = parse_response_to_payload(resp)
        return payload

There is a type conversion happening that performs conversion from int to float this is problematic when expecting an index to access arrays or for IDs in general.

Payload will be modified in following way:

   {'engine': 'BEST', 'regions': [{'y': 828.0, 'h': 36.0, 'x':        
       1661.0, 'pageIndex': 1.0, 'id': 9359800610.0, 'w': 551.0}, {'x':        
       1614.0, 'w': 601.0, 'y': 691.0, 'id': 9359800604.0, 'pageIndex': 1.0,   
       'h': 33.0}], 'pipeline': {'preprocessors': []}, 'srcBase64':     

Here the 'pageIndex': 1.0 have been converted to a float

Exception:

       ╭───────────────── Traceback (most recent call last) ─────────────────╮ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in extract                                                          │ 
       │                                                                     │ 
       │   266 │   │   │   │   │   ro_frames, queue_id, checksum, pms_mode,  │ 
       │   267 │   │   │   │   )                                             │ 
       │   268 │   │   │   else:                                             │ 
       │ ❱ 269 │   │   │   │   results = self.__process_extract_regions(     │ 
       │   270 │   │   │   │   │   ro_frames, queue_id, checksum, pms_mode,  │ 
       │   271 │   │   │   │   )                                             │ 
       │   272                                                               │ 
       │                                                                     │ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in __process_extract_regions                                        │ 
       │                                                                     │ 
       │   203 │   │   │   │   │   │   output.append(region_result)          │ 
       │   204 │   │   │   except Exception as ex:                           │ 
       │   205 │   │   │   │   self.logger.error(ex)                         │ 
       │ ❱ 206 │   │   │   │   raise ex                                      │ 
       │   207 │   │                                                         │ 
       │   208 │   │   # Filter out base 64 encoded fragments(fragment_b64,  │ 
       │   209 │   │   # This is useful when we like to display or process i │ 
       │       significant payload overhead                                  │ 
       │                                                                     │ 
       │ /home/gbugaj/dev/marieai/marie-ai/marie/ocr/default_ocr_engine.py:… │ 
       │ in __process_extract_regions                                        │ 
       │                                                                     │ 
       │   154 │   │   │   │   w = region["w"]                               │ 
       │   155 │   │   │   │   h = region["h"]                               │ 
       │   156 │   │   │   │                                                 │ 
       │ ❱ 157 │   │   │   │   img = frames[page_index]                      │ 
       │   158 │   │   │   │   img = img[y : y + h, x : x + w].copy()        │ 
       │   159 │   │   │   │   overlay = img                                 │ 
       │   160                                                               │ 
       ╰─────────────────────────────────────────────────────────────────────╯ 

Implement message queue service

In order to communicate in WAN environments we need to utilize some kind of message queue service.

Possible providers :

  • Amazon Simple Queue Service (SQS)
  • Amazon Message Queue
  • Google pubsub
  • RabbitMQ Shovel

Document Partitioning Scheme

Support to add document partitioning scheme where large page documents can per partitioned and then processed on multiple nodes.

Add JAX support

Core framework should support JAX framework out off the start.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.