With a khoj.yml file containing: <div class="snippet-clipboard-content notranslate

But what about numbered lists and - , <code class="no

Benjamin Slade ***@***.***> writes: <div class="email-quoted-re

errors with org files containing certain types of structures,about khoj-ai/khoj

Comments (21)

emacsomancer commented on May 16, 2024 1

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

One, a file with *-headings, but only *-headers (nothing "underneath" any of the headings):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 181, in setup
    corpus_embeddings = compute_embeddings(entries, bi_encoder, config.embeddings_file, regenerate=regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 66, in compute_embeddings
    corpus_embeddings = bi_encoder.encode([entry['compiled'] for entry in entries], convert_to_tensor=True, device=state.device, show_progress_bar=True)
  File "/home/slade/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
    all_embeddings = torch.stack(all_embeddings)
RuntimeError: stack expects a non-empty TensorList

Two, a file with no *-headings at all (but which is still a valid .org file):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()

I can provide specific examples of such files if necessary.

from khoj.

yantar92 commented on May 16, 2024 1

The error is gone on my side.

from khoj.

debanjum commented on May 16, 2024

It doesn't seem to be an issue with input-filter. More like an issue in parsing (some of?) your org file(s) by the OrgNode parser. The level argument is expected be a string of *s seen at the start of a heading instead of an int.

Are you seeing this issue even when you only set the input-files field in the khoj.yml?

Next Steps:

Let me try reproduce the issue on my end to see what's causing it
If you can share a test file or snippet that could be causing the failure, it'll speed up the fix for this issue.
Until then you can try bypass the issue by identifying and excluding the file(s) that maybe causing the parsing error

from khoj.

emacsomancer commented on May 16, 2024

But what about numbered lists and -, + type headings?

(Additional: on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?)

from khoj.

yantar92 commented on May 16, 2024

Benjamin Slade ***@***.***> writes:

self.level = len(level) TypeError: object of type 'int' has no len() ```

For reference, I am also seeing this error on my Org files. Note that Orgnode is very simplistic. The most accurate and fast Org parser in the wild that I know of is https://github.com/tecosaur/Org.jl P.S. Your project is very promising :)

…

-- Ihor Radchenko, Org mode contributor, Learn more about Org mode at https://orgmode.org/. Support Org development at https://liberapay.com/org-mode, or support my work at https://liberapay.com/yantar92

from khoj.

debanjum commented on May 16, 2024

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

Ah, thanks for the additional details. I can reproduce both issues now.
To summarize khoj is failing to handle two type of files:

Files with only entries with no body
Files with no entries

Where an entry in org terminology is anything that starts with a heading and heading is anything that starts with *s

Context:

khoj indexes and shows results at a per entry level (similar to org-agenda search, org-rifle etc)
So khoj shouldn't fail when it sees files with no entries but it'll still ignore such files going forward
It only indexes entries with body text (i.e has something underneath the heading)
This was done for quality of results reasons. Do you feel a need for khoj to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

As an immediate mitigation, I'll make khoj safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

from khoj.

yantar92 commented on May 16, 2024

Debanjum ***@***.***> writes:

> Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors. Ah, thanks for the additional details. I can reproduce both issues now. To summarize `khoj` is failing to handle two type of files: 1. Files with only entries with no body 2. Files with no entries *Where an `entry` in `org` terminology is anything that starts with a `heading` and `heading` is anything that starts with `*`s*

Clarification: heading must start at bol and must have a space after "*": "^\\*+ ".

3. It only indexes entries with body text (i.e has something underneath the heading) This was done for quality of results reasons. Do you feel a need for `khoj` to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

When Org is used for bookmark management, empty bodies are not uncommon. Also, do you consider property drawer as a part of body?

As an immediate mitigation, I'll make `khoj` safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

In the context of org-roam, files without headings may not be uncommon - the "entry" is then defined by #+TITLE keyword or something similar. Of course, not failing on Org files without headings is a good starting point.

…

from khoj.

debanjum commented on May 16, 2024

on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?

Khoj doesn't currently support multiple input-filters but created an issue to track adding that. No way to exclude files from filter for now either but maybe I'll resolve it when the multiple input-filter support is added

from khoj.

debanjum commented on May 16, 2024

Clarification: heading must start at bol and must have a space after "": "^\+ ".

Agreed, That's how it is handled in code. I was just trying to keep my definition less verbose :)

Also, do you consider property drawer as a part of body?

Property drawers are not considered part of body* for the purposes of indexing in khoj

When Org is used for bookmark management, empty bodies are not uncommon.
In the context of org-roam, files without headings may not be uncommon -
the "entry" is then defined by #+TITLE keyword or something similar.

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

from khoj.

yantar92 commented on May 16, 2024

Debanjum ***@***.***> writes:

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

See examples in https://github.com/yantar92/org-capture-ref Bookmarks to this repo looks like ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY: :PROPERTIES: :TITLE: debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BTYPE: misc :ID: Github-debanjum-debanjum-khoj-natural-e4a :AUTHOR: debanjum :CREATED: [2022-09-07 Wed 21:51] :HOWPUBLISHED: Github :NOTE: Online; accessed 07 September 2022 :RSS: https://github.com/debanjum/khoj/commits.atom :URL: https://github.com/debanjum/khoj :END: :LOGBOOK: - Refiled on [2022-09-07 Wed 22:47] :END:

…

from khoj.

debanjum commented on May 16, 2024

Note that Orgnode is very simplistic. The most accurate and fast Org
parser in the wild that I know of is https://github.com/tecosaur/Org.jl

Yeah, OrgNode is very basic. I've modified it for khoj to handle more scenarios but it's pretty ad-hoc. Org.jl looks interesting. I'm also tracking Org-Parser as they're being more methodical about parsing org syntax

from khoj.

debanjum commented on May 16, 2024

Bookmarks to this repo looks like

    ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY:
    :PROPERTIES:
    :TITLE:    debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos
    :BTYPE:    misc
    :ID:       Github-debanjum-debanjum-khoj-natural-e4a
    :AUTHOR:   debanjum
    :CREATED:  [2022-09-07 Wed 21:51]
    :HOWPUBLISHED: Github
    :NOTE:     Online; accessed 07 September 2022
    :RSS:      https://github.com/debanjum/khoj/commits.atom
    :URL:      https://github.com/debanjum/khoj
    :END:
    :LOGBOOK:
    - Refiled on [2022-09-07 Wed 22:47]
    :END:

I see. This will happen to get indexed as entries with logbook drawer notes get indexed. But I see the use-case for indexing entries with no body text in khoj. Will add support for it soon

from khoj.

yantar92 commented on May 16, 2024

Debanjum ***@***.***> writes:

Yeah, OrgNode is very basic. I've modified it for `khoj` to handle more scenarios but it's pretty ad-hoc. `Org.jl` looks interesting. I'm also tracking [Org-Parser](https://github.com/200ok-ch/org-parser) as they're being more methodical about parsing org syntax

org-parser has major issues with performance scaling (200ok-ch/org-parser#56). Org.jl, on the other hand, has been developed by one of the core Org developers :) It is even faster than tree sitter Org syntax (https://github.com/milisims/tree-sitter-org). Yet another parser is https://github.com/tgbugs/laundry Of course, the basic headline parsing does not require all these fancy parsers.

…

from khoj.

debanjum commented on May 16, 2024

Ah, hadn't seen the org-parser perf concerns. The rest of the parsers info is very informative too. But yeah for khoj nothing too fancy is required (currently). The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

from khoj.

yantar92 commented on May 16, 2024

Debanjum ***@***.***> writes:

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes. Such use-case is one of the two common paradigms to organize Org files (one large file vs. many small files aka org-roam). I am now trying to index my 20Mb notes.org file and the estimate says that the process will take over 1 hour to complete. Even if done in the background, such a long re-indexing will take forever to complete.

…

from khoj.

debanjum commented on May 16, 2024

Yes, indexing takes quite a while for larger data sets. Most of this is due to the model generating embeddings. And not the actual file parsing itself.

PR #75 is meant to make this long indexing time only required the first time (or whenever a large amount of new data is to be indexed). But for subsequent runs it'll only re-index new or modified entries. This should speed up updating the index significantly. Enough hopefully so that the index can be updated automatically in the background from within the app itself 🤞🏾

from khoj.

debanjum commented on May 16, 2024

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes.

Note: Currently the index has to be manually updated by the user (by calling the /regenerate API endpoint). The user should not expect khoj to search on the latest modified notes but on the last indexed notes. Even once automatic indexing is implemented, the index will lag the latest state of notes.

This shouldn't impact most practical use-cases IMO, as you're usually searching for older entries that you don't recall, not the latest edits to notes you may have just made.

from khoj.

debanjum commented on May 16, 2024

@emacsomancer, @yantar92 I've merged fixes for the 2 main issues found on this thread to master. khoj should now:

Parse org files with no headings
Throw error (with appropriate message) if no valid entries found

It'd be great if you can try the merged changes by upgrading to a pre-release build of khoj with:

pip install --upgrade --pre khoj-assistant

Let me know if this hasn't fixed the above issues for you all

from khoj.

debanjum commented on May 16, 2024

That's good to know! Thanks for verifying

from khoj.

emacsomancer commented on May 16, 2024

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 112, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 83, in makelist
    for line in f:
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 676: invalid start byte

from khoj.

debanjum commented on May 16, 2024

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

Thanks for validating! Good to know that the initial issue is resolved. I'll push an update to index header only entries soon.

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

And thanks for discovering another bug! :) Could you please open a separate Github issue to track this new error. It'll make it easier to track separate bugs (and fixes) for future reference

from khoj.

errors with org files containing certain types of structures about khoj HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent