Giter Club home page Giter Club logo

Comments (21)

emacsomancer avatar emacsomancer commented on May 16, 2024 1

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

One, a file with *-headings, but only *-headers (nothing "underneath" any of the headings):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 181, in setup
    corpus_embeddings = compute_embeddings(entries, bi_encoder, config.embeddings_file, regenerate=regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 66, in compute_embeddings
    corpus_embeddings = bi_encoder.encode([entry['compiled'] for entry in entries], convert_to_tensor=True, device=state.device, show_progress_bar=True)
  File "/home/slade/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
    all_embeddings = torch.stack(all_embeddings)
RuntimeError: stack expects a non-empty TensorList

Two, a file with no *-headings at all (but which is still a valid .org file):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
    thisNode = Orgnode(level, heading, bodytext, tags)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
    self.level = len(level)
TypeError: object of type 'int' has no len()

I can provide specific examples of such files if necessary.

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024 1

The error is gone on my side.

from khoj.

debanjum avatar debanjum commented on May 16, 2024

It doesn't seem to be an issue with input-filter. More like an issue in parsing (some of?) your org file(s) by the OrgNode parser. The level argument is expected be a string of *s seen at the start of a heading instead of an int.

Are you seeing this issue even when you only set the input-files field in the khoj.yml?

Next Steps:

  • Let me try reproduce the issue on my end to see what's causing it
  • If you can share a test file or snippet that could be causing the failure, it'll speed up the fix for this issue.
  • Until then you can try bypass the issue by identifying and excluding the file(s) that maybe causing the parsing error

from khoj.

emacsomancer avatar emacsomancer commented on May 16, 2024

But what about numbered lists and -, + type headings?

(Additional: on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?)

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.

Ah, thanks for the additional details. I can reproduce both issues now.
To summarize khoj is failing to handle two type of files:

  1. Files with only entries with no body
  2. Files with no entries

Where an entry in org terminology is anything that starts with a heading and heading is anything that starts with *s

Context:

  1. khoj indexes and shows results at a per entry level (similar to org-agenda search, org-rifle etc)
    So khoj shouldn't fail when it sees files with no entries but it'll still ignore such files going forward
  2. It only indexes entries with body text (i.e has something underneath the heading)
    This was done for quality of results reasons. Do you feel a need for khoj to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.

As an immediate mitigation, I'll make khoj safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024

from khoj.

debanjum avatar debanjum commented on May 16, 2024

on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?

Khoj doesn't currently support multiple input-filters but created an issue to track adding that. No way to exclude files from filter for now either but maybe I'll resolve it when the multiple input-filter support is added

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Clarification: heading must start at bol and must have a space after "": "^\+ ".

Agreed, That's how it is handled in code. I was just trying to keep my definition less verbose :)

Also, do you consider property drawer as a part of body?

Property drawers are not considered part of body* for the purposes of indexing in khoj

When Org is used for bookmark management, empty bodies are not uncommon.
In the context of org-roam, files without headings may not be uncommon -
the "entry" is then defined by #+TITLE keyword or something similar.

I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Note that Orgnode is very simplistic. The most accurate and fast Org
parser in the wild that I know of is https://github.com/tecosaur/Org.jl

Yeah, OrgNode is very basic. I've modified it for khoj to handle more scenarios but it's pretty ad-hoc. Org.jl looks interesting. I'm also tracking Org-Parser as they're being more methodical about parsing org syntax

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Bookmarks to this repo looks like

    ***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY:
    :PROPERTIES:
    :TITLE:    debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos
    :BTYPE:    misc
    :ID:       Github-debanjum-debanjum-khoj-natural-e4a
    :AUTHOR:   debanjum
    :CREATED:  [2022-09-07 Wed 21:51]
    :HOWPUBLISHED: Github
    :NOTE:     Online; accessed 07 September 2022
    :RSS:      https://github.com/debanjum/khoj/commits.atom
    :URL:      https://github.com/debanjum/khoj
    :END:
    :LOGBOOK:
    - Refiled on [2022-09-07 Wed 22:47]
    :END:

I see. This will happen to get indexed as entries with logbook drawer notes get indexed. But I see the use-case for indexing entries with no body text in khoj. Will add support for it soon

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Ah, hadn't seen the org-parser perf concerns. The rest of the parsers info is very informative too. But yeah for khoj nothing too fancy is required (currently). The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

from khoj.

yantar92 avatar yantar92 commented on May 16, 2024

from khoj.

debanjum avatar debanjum commented on May 16, 2024

Yes, indexing takes quite a while for larger data sets. Most of this is due to the model generating embeddings. And not the actual file parsing itself.

PR #75 is meant to make this long indexing time only required the first time (or whenever a large amount of new data is to be indexed). But for subsequent runs it'll only re-index new or modified entries. This should speed up updating the index significantly. Enough hopefully so that the index can be updated automatically in the background from within the app itself 🤞🏾

from khoj.

debanjum avatar debanjum commented on May 16, 2024

... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.

This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes.

Note: Currently the index has to be manually updated by the user (by calling the /regenerate API endpoint). The user should not expect khoj to search on the latest modified notes but on the last indexed notes. Even once automatic indexing is implemented, the index will lag the latest state of notes.

This shouldn't impact most practical use-cases IMO, as you're usually searching for older entries that you don't recall, not the latest edits to notes you may have just made.

from khoj.

debanjum avatar debanjum commented on May 16, 2024

@emacsomancer, @yantar92 I've merged fixes for the 2 main issues found on this thread to master. khoj should now:

  1. Parse org files with no headings
  2. Throw error (with appropriate message) if no valid entries found

It'd be great if you can try the merged changes by upgrading to a pre-release build of khoj with:

pip install --upgrade --pre khoj-assistant

Let me know if this hasn't fixed the above issues for you all

from khoj.

debanjum avatar debanjum commented on May 16, 2024

That's good to know! Thanks for verifying

from khoj.

emacsomancer avatar emacsomancer commented on May 16, 2024

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

Traceback (most recent call last):
  File "/home/slade/.local/bin/khoj", line 8, in <module>
    sys.exit(run())
  File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 112, in run
    configure_server(args, required=False)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
    state.model = configure_search(state.model, state.config, args.regenerate)
  File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
    model.orgmode_search = text_search.setup(
  File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
    text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
    entries, file_to_entries = extract_org_entries(org_files)
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
    org_file_entries = orgnode.makelist(str(org_file))
  File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 83, in makelist
    for line in f:
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 676: invalid start byte

from khoj.

debanjum avatar debanjum commented on May 16, 2024

On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).

Thanks for validating! Good to know that the initial issue is resolved. I'll push an update to index header only entries soon.

But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):

And thanks for discovering another bug! :) Could you please open a separate Github issue to track this new error. It'll make it easier to track separate bugs (and fixes) for future reference

from khoj.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.