Giter Club home page Giter Club logo

Comments (19)

pseudotensor avatar pseudotensor commented on August 19, 2024 1

For nomic, it uses an unreleased sentence transformers 2.4.0dev. An attempt to pip install that dev version leads to issues failures in their API. So nothing is usable.

Once sentence transformers 2.4.0 is released without bugs, we can upgrade h2oGPT and pass the required option trust_remote_code, that option does not exist in prior sentence transformers.

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024 1

4096 is on high end, yes can make it smaller as required. If on CPU I expect should work pretty ok, but issue is bge-m3 has 8k context so uses alot more memory despite size if chunks are large.

I think issue is that for summarization purposes, we double the chunks, and there's no limit to their size, so that might be hitting the bge-m3 model hard since it'll take 8k.

One will have to tell the model to truncate at (say) smaller token counts or (yes) limit batch size.

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

Probably bge m3 is better. Did you try that one? Also has long context and much smaller than other LLM based models.

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2' sounds like DocTR is not installed correctly.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor no, I didn't tried bge m3, will try. The usage is just to specify BAAI/bge-m3 or do I need to install anything ?
As for DocTR, I checked everything according to their github, don't see anything that is missing, and yet getting : H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
Thanks

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor wanted to try bge, but it's failed (though regardless of the embeddings model). A few month ago that was working fine, when I did ingestion with another embeddings (same dataset)

python src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192 --user_path=/Users/slava/Documents/Development/private/ZendDeskTicketsNew -collection_name=ZenDeskTicketsWithDocsBGE
 59%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                         | 27/46 [00:07<00:05,  3.41it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 60%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                       | 28/47 [00:08<00:05,  3.30it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 65%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                              | 35/54 [00:08<00:04,  4.09it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 66%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                            | 37/56 [00:08<00:04,  4.30it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                          | 39/58 [00:08<00:04,  4.53it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 69%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                      | 43/62 [00:08<00:03,  4.97it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                     | 44/63 [00:08<00:03,  5.05it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                 | 49/68 [00:08<00:03,  5.58it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 51/70 [00:08<00:03,  5.81it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                            | 3106/4776 [00:09<00:05, 329.55it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 63%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                               | 3107/4904 [00:09<00:05, 329.02it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 62%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                                 | 3108/5032 [00:09<00:05, 328.77it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 60%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                    | 3109/5160 [00:09<00:06, 328.64it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
 63%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                              | 4007/6312 [00:09<00:05, 418.69it/s]H2OOCRLoader: unknown architecture 'crnn_efficientnetv2_mV2'
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41067/41067 [00:15<00:00, 2675.56it/s]
Exceptions: 0/92227 []
[1]    41576 killed     python3 src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192  
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /var/folders/z1/qsct20p17nxdfhjlp29yr6r40000gn/T/joblib_memmapping_folder_41576_f980b37c109a4082a6f2c1759202b4c9_31ee9ea798ce4943b8c16dcd8596d055: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

No other info is available, the process just terminated in the middle.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor Hi, can you please advise on the issue I mentioned above with make_db ?
Thanks a lot !!!

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

For docTR you can't use their repo, has to be installed from our fork as in the readme_linux.md or its linux_install.sh

The model being missing seems like the original DocTR repo is used.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor Thanks,
What about make_db that crashes in the first minutes of execution, without too many info

Exceptions: 0/92227 []
[1]    41576 killed     python3 src/make_db.py --hf_embedding_model=BAAI/bge-m3 --chunk_size=8192  
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /var/folders/z1/qsct20p17nxdfhjlp29yr6r40000gn/T/joblib_memmapping_folder_41576_f980b37c109a4082a6f2c1759202b4c9_31ee9ea798ce4943b8c16dcd8596d055: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

Thanks.

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

Looks like system OOM. You can check sudo dmesg -T to see if OOM Killer hit.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor indeed OOM, but don't know why this started to happen, the memory size of the process reached 86GB. When I have Mac with 32GB and swap. But in the past I was able to create db from those files (tried now default embeddings).

Please advise.
Thanks

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

Maybe one of the parsers went nuts, e.g. tesseract may have a bug. On gpt.h2o.ai I had one case when memory went up to the peak of 512GB.

Are you able to see from the verbose logging which document might have been an issue?

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor No, I don't see , and now I have only text files (removed pdfs) and still got same issue.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor seems that this happens issue is after the parsing

.....
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket54810.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket72169.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket72169.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket119236.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket119236.txt
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 42124/42128 [00:26<00:00, 1642.83it/s]Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket73277.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket73277.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket35222.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket35222.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket87490.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket87490.txt
Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket63064.txt
DONE Ingesting file: /Users/slava/Documents/Development/private/ZendDeskTickets/ticket63064.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 42128/42128 [00:26<00:00, 1576.36it/s]
0it [00:00, ?it/s]
END consuming path_or_paths=/Users/slava/Documents/Development/private/ZendDeskTickets url=None text=None
Exceptions: 0/498289 []
Loading and updating db
Found 498289 new sources (0 have no hash in original source, so have to reprocess for migration to sources with hash)
Removing 0 duplicate files from db because ingesting those as new documents
Existing db, adding to db_dir_ZenDeskTicketsWithDocsBGE
[1]    91113 killed     python3 src/make_db.py  -collection_name=ZenDeskTicketsWithDocsBGE  --n_jobs=
/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                      
  warnings.warn('resource_tracker: There appear to be %d '

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

Tried to run make_db with memory profiler.
First, I don't have PDFs but looks like somehow code uses doctr and other, and then I see huge allocation on transformers.xml_roberta.
Have a look on the screenshot
image

Summary of allocations

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Location                                                                                                           ┃        <Total Memory> ┃        Total Memory % ┃            Own Memory ┃          Own Memory % ┃      Allocation Count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ _PyEval_Vector at <unknown>                                                                                        │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677454 │
│ _run_tracker at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/memray/commands/run.py            │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677474 │
│ run_path at <frozen runpy>                                                                                         │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677470 │
│ PyObject_Vectorcall at <unknown>                                                                                   │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677457 │
│ cfunction_vectorcall_FASTCALL_KEYWORDS at <unknown>                                                                │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677442 │
│ PyEval_EvalCode at <unknown>                                                                                       │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677434 │
│ builtin_exec at <unknown>                                                                                          │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677434 │
│ _run_code at <frozen runpy>                                                                                        │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677425 │
│ _run_module_code at <frozen runpy>                                                                                 │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677425 │
│ <module> at src/make_db.py                                                                                         │             170.146GB │               100.00% │                0.000B │                 0.00% │               1677423 │
│ _PyObject_MakeTpCall at <unknown>                                                                                  │             168.601GB │                99.09% │                0.000B │                 0.00% │               1288653 │
│ H2O_Fire at /Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py                                      │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450699 │
│ Fire at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py                              │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450694 │
│ _Fire at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py                             │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450689 │
│ _CallAndUpdateTrace at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/fire/core.py               │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450688 │
│ make_db_main at src/make_db.py                                                                                     │             137.791GB │                80.98% │                0.000B │                 0.00% │               1450681 │
│ _PyVectorcall_Call at <unknown>                                                                                    │             137.671GB │                80.91% │                0.000B │                 0.00% │               1402343 │
│ method_vectorcall at <unknown>                                                                                     │             136.676GB │                80.33% │                0.000B │                 0.00% │               1254230 │
│ create_or_update_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                   │             136.557GB │                80.26% │                0.000B │                 0.00% │               1211409 │
│ get_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                                │             136.557GB │                80.26% │                0.000B │                 0.00% │               1211407 │
│ _PyObject_FastCallDictTstate at <unknown>                                                                          │             136.520GB │                80.24% │                0.000B │                 0.00% │               1219361 │
│ _PyObject_Call_Prepend at <unknown>                                                                                │             136.499GB │                80.22% │                0.000B │                 0.00% │               1199368 │
│ _PyObject_Call at <unknown>                                                                                        │             136.447GB │                80.19% │                0.000B │                 0.00% │               1174211 │
│ cfunction_call at <unknown>                                                                                        │             136.258GB │                80.08% │                0.000B │                 0.00% │                 77360 │
│ c10::DefaultCPUAllocator::allocate(unsigned long) const at <unknown>                                               │             136.205GB │                80.05% │                0.000B │                 0.00% │                   508 │
│ at::TensorBase at::detail::_empty_generic<long long>(c10::ArrayRef<long long>, c10::Allocator*,                    │             136.205GB │                80.05% │                0.000B │                 0.00% │                  1531 │
│ c10::DispatchKeySet, c10::ScalarType, std::__1::optional<c10::MemoryFormat>) at <unknown>                          │                       │                       │                       │                       │                       │
│ c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl>>            │             136.205GB │                80.05% │                0.000B │                 0.00% │                  1014 │
│ c10::intrusive_ptr<c10::StorageImpl,                                                                               │                       │                       │                       │                       │                       │
│ c10::detail::intrusive_target_default_null_type<c10::StorageImpl>>::make<c10::StorageImpl::use_byte_size_t,        │                       │                       │                       │                       │                       │
│ unsigned long&, c10::Allocator*&, bool>(c10::StorageImpl::use_byte_size_t&&, unsigned long&, c10::Allocator*&,     │                       │                       │                       │                       │                       │
│ bool&&) at <unknown>                                                                                               │                       │                       │                       │                       │                       │
│ c10::StorageImpl::StorageImpl(c10::StorageImpl::use_byte_size_t, c10::SymInt const&, c10::Allocator*, bool) at     │             136.205GB │                80.05% │                0.000B │                 0.00% │                   506 │
│ <unknown>                                                                                                          │                       │                       │                       │                       │                       │
│ add_to_db at /Users/slava/Documents/Development/private/AI/h2ogpt/src/gpt_langchain.py                             │             134.109GB │                78.82% │                0.000B │                 0.00% │                 37446 │
│ add_documents at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_core/vectorstores.py   │             134.104GB │                78.82% │                0.000B │                 0.00% │                 37402 │
│ add_texts at                                                                                                       │             134.102GB │                78.82% │                0.000B │                 0.00% │                 37399 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_community/vectorstores/chroma.py        │                       │                       │                       │                       │                       │
│ embed_documents at                                                                                                 │             134.093GB │                78.81% │                0.000B │                 0.00% │                 37552 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain_community/embeddings/huggingface.py     │                       │                       │                       │                       │                       │
│ slot_tp_call at <unknown>                                                                                          │             134.010GB │                78.76% │                0.000B │                 0.00% │                  2414 │
│ encode at                                                                                                          │             134.007GB │                78.76% │                0.000B │                 0.00% │                   231 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py      │                       │                       │                       │                       │                       │
│ _wrapped_call_impl at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/module.py  │             134.001GB │                78.76% │                0.000B │                 0.00% │                   125 │
│ forward at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/container.py          │             134.001GB │                78.76% │                0.000B │                 0.00% │                   125 │
│ _call_impl at /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/nn/modules/module.py          │             134.001GB │                78.76% │                0.000B │                 0.00% │                   124 │
│ forward at                                                                                                         │             134.001GB │                78.76% │                0.000B │                 0.00% │                   112 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py       │                       │                       │                       │                       │                       │
│ forward at                                                                                                         │             134.001GB │                78.76% │                0.000B │                 0.00% │                   111 │
│ /Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_rob… │                       │                       │                       │                       │                       │
│ torch::autograd::THPVariable_matmul(_object*, _object*, _object*) at <unknown>                                     │             130.000GB │                76.40% │                0.000B │                 0.00% │                    22 │
│ at::native::_matmul_impl(at::Tensor&, at::Tensor const&, at::Tensor const&) at <unknown>                           │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ at::native::matmul(at::Tensor const&, at::Tensor const&) at <unknown>                                              │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) at <unknown>                                          │             130.000GB │                76.40% │                0.000B │                 0.00% │                    20 │
│ c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPoint… │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor  │                       │                       │                       │                       │                       │
│ const&)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&>>, at::Tensor (at::Tensor │                       │                       │                       │                       │                       │
│ const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) │                       │                       │                       │                       │                       │
│ at <unknown>                                                                                                       │                       │                       │                       │                       │                       │
│ c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPoint… │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous            │                       │                       │                       │                       │                       │
│ namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>, at::Tensor,                           │                       │                       │                       │                       │                       │
│ c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&>>, at::Tensor              │                       │                       │                       │                       │                       │
│ (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet,      │                       │                       │                       │                       │                       │
│ at::Tensor const&, at::Tensor const&) at <unknown>                                                                 │                       │                       │                       │                       │                       │
│ at::_ops::bmm::call(at::Tensor const&, at::Tensor const&) at <unknown>                                             │             128.000GB │                75.23% │                0.000B │                 0.00% │                    12 │
│ at::(anonymous namespace)::structured_bmm_out_cpu_functional::set_output_raw_strided(long long, c10::ArrayRef<long │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ long>, c10::ArrayRef<long long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>) at <unknown>                      │                       │                       │                       │                       │                       │
│ void at::meta::common_checks_baddbmm_bmm<at::meta::structured_bmm>(at::meta::structured_bmm&, at::Tensor const&,   │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ at::Tensor const&, c10::Scalar const&, c10::Scalar const&, bool, std::__1::optional<at::Tensor> const&) at         │                       │                       │                       │                       │                       │
│ <unknown>                                                                                                          │                       │                       │                       │                       │                       │
│ at::meta::structured_bmm::meta(at::Tensor const&, at::Tensor const&) at <unknown>                                  │             128.000GB │                75.23% │                0.000B │                 0.00% │                     4 │
│ _find_and_load at <frozen importlib._bootstrap>                                                                    │              32.409GB │                19.05% │                0.000B │                 0.00% │                241978 │
│ _find_and_load_unlocked at <frozen importlib._bootstrap>                                                           │              32.409GB │                19.05% │                0.000B │                 0.00% │                241976 │
│ _load_unlocked at <frozen importlib._bootstrap>                                                                    │              32.409GB │                19.05% │                0.000B │                 0.00% │                241770 │
│ exec_module at <frozen importlib._bootstrap_external>                                                              │              32.409GB │                19.05% │                0.000B │                 0.00% │                241768 │
│ object_vacall at <unknown>                                                                                         │              32.406GB │                19.05% │                0.000B │                 0.00% │                239971 │
│ PyObject_CallMethodObjArgs at <unknown>                                                                            │              32.406GB │                19.05% │                0.000B │                 0.00% │                239971 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┘
🥇 Top 5 largest allocating locations (by size):
	- forward:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py:237 -> 130.002GB
	- init_lib:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/pypdfium2/_library_scope.py:25 -> 32.000GB
	- hash_file:/Users/slava/Documents/Development/private/AI/h2ogpt/src/utils.py:1124 -> 5.313GB
	- filter:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/fnmatch.py:56 -> 4.468GB
	- load_file:/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/safetensors/torch.py:308 -> 4.231GB

Can you please to try to create any large db with default embedding model and check if it's working on your end ?
Thanks

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

Nice tool. It looks as if the chroma team changed something. Maybe the batching size is larger and they send arbitrarily large batch to the embedding model.

Can you add some prints to this code?

h2ogpt/src/gpt_langchain.py

Lines 318 to 346 in 190310d

num_new_sources = len(sources)
if num_new_sources == 0:
return db, num_new_sources, []
if hasattr(db, '_persist_directory'):
print("Existing db, adding to %s" % db._persist_directory, flush=True)
# chroma only
lock_file = get_db_lock_file(db)
context = filelock.FileLock
else:
lock_file = None
context = NullContext
with context(lock_file):
# this is place where add to db, but others maybe accessing db, so lock access.
# else see RuntimeError: Index seems to be corrupted or unsupported
import chromadb
api = chromadb.PersistentClient(path=db._persist_directory)
if hasattr(api, 'max_batch_size'):
max_batch_size = api.max_batch_size
elif hasattr(api, '_producer') and hasattr(api._producer, 'max_batch_size'):
max_batch_size = api._producer.max_batch_size
else:
max_batch_size = int(os.getenv('CHROMA_MAX_BATCH_SIZE', '100'))
sources_batches = split_list(sources, max_batch_size)
for sources_batch in sources_batches:
db.add_documents(documents=sources_batch)
db.persist()
clear_embedding(db)
# save here is for migration, in case old db directory without embedding saved
save_embed(db, use_openai_embedding, hf_embedding_model)

Specifically, the max_batch_size can be printed. Maybe it's crazy large.

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor Max batch size 83333, coming from max_batch_size attr.

from h2ogpt.

pseudotensor avatar pseudotensor commented on August 19, 2024

Try the latest changes. 83333 is very large. I made the max 4096. Or you can control via env CHROMA_MAX_BATCH_SIZE

from h2ogpt.

slavag avatar slavag commented on August 19, 2024

@pseudotensor much better, thanks.
Btw, bge m3 on MacBook Pro 32GB M1Max, batch size > 4 is failing. not enough memory.
On Linux with NVidia nvidia a10g (24GB) with 4096 also failed, working with 1 or 2.

Also, maybe it's a good idea to add device to make_db, as on mac it can use metal.

from h2ogpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.