Comments (21)
Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.
One, a file with *-headings, but only *-headers (nothing "underneath" any of the headings):
Traceback (most recent call last):
File "/home/slade/.local/bin/khoj", line 8, in <module>
sys.exit(run())
File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
configure_server(args, required=False)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
state.model = configure_search(state.model, state.config, args.regenerate)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
model.orgmode_search = text_search.setup(
File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 181, in setup
corpus_embeddings = compute_embeddings(entries, bi_encoder, config.embeddings_file, regenerate=regenerate)
File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 66, in compute_embeddings
corpus_embeddings = bi_encoder.encode([entry['compiled'] for entry in entries], convert_to_tensor=True, device=state.device, show_progress_bar=True)
File "/home/slade/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
all_embeddings = torch.stack(all_embeddings)
RuntimeError: stack expects a non-empty TensorList
Two, a file with no *-headings at all (but which is still a valid .org file):
Traceback (most recent call last):
File "/home/slade/.local/bin/khoj", line 8, in <module>
sys.exit(run())
File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 108, in run
configure_server(args, required=False)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
state.model = configure_search(state.model, state.config, args.regenerate)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
model.orgmode_search = text_search.setup(
File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
entries, file_to_entries = extract_org_entries(org_files)
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
org_file_entries = orgnode.makelist(str(org_file))
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 170, in makelist
thisNode = Orgnode(level, heading, bodytext, tags)
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 217, in __init__
self.level = len(level)
TypeError: object of type 'int' has no len()
I can provide specific examples of such files if necessary.
from khoj.
The error is gone on my side.
from khoj.
It doesn't seem to be an issue with input-filter
. More like an issue in parsing (some of?) your org
file(s) by the OrgNode
parser. The level
argument is expected be a string
of *
s seen at the start of a heading instead of an int
.
Are you seeing this issue even when you only set the input-files
field in the khoj.yml
?
Next Steps:
- Let me try reproduce the issue on my end to see what's causing it
- If you can share a test file or snippet that could be causing the failure, it'll speed up the fix for this issue.
- Until then you can try bypass the issue by identifying and excluding the file(s) that maybe causing the parsing error
from khoj.
But what about numbered lists and -
, +
type headings?
(Additional: on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?)
from khoj.
from khoj.
Ok, I think I've narrowed it down to particular types of files, for which I can get at least two different types of errors.
Ah, thanks for the additional details. I can reproduce both issues now.
To summarize khoj
is failing to handle two type of files:
- Files with only entries with no body
- Files with no entries
Where an entry
in org
terminology is anything that starts with a heading
and heading
is anything that starts with *
s
Context:
khoj
indexes and shows results at a per entry level (similar to org-agenda search, org-rifle etc)
Sokhoj
shouldn't fail when it sees files with no entries but it'll still ignore such files going forward- It only indexes entries with body text (i.e has something underneath the heading)
This was done for quality of results reasons. Do you feel a need forkhoj
to index entries with only headings or is it fine to ignore such entries? If index heading only entries is needed, we can index them but I'll add a filter to ignore entries with no body filter before we do.
As an immediate mitigation, I'll make khoj
safely ignore the two cases instead of failing. Later, if needed, we can add more thought out solutions. Does that sound reasonable?
from khoj.
from khoj.
on excluding files and so on: can there be multiple filters? (for files in different locations) and is there a way to exclude files from a filter?
Khoj doesn't currently support multiple input-filter
s but created an issue to track adding that. No way to exclude files from filter for now either but maybe I'll resolve it when the multiple input-filter
support is added
from khoj.
Clarification: heading must start at bol and must have a space after "": "^\+ ".
Agreed, That's how it is handled in code. I was just trying to keep my definition less verbose :)
Also, do you consider property drawer as a part of body?
Property drawers are not considered part of body* for the purposes of indexing in khoj
When Org is used for bookmark management, empty bodies are not uncommon.
In the context of org-roam, files without headings may not be uncommon -
the "entry" is then defined by #+TITLE keyword or something similar.
I see. Can you clarify the bookmark management scenario a bit more? Seems like there is a use-case for handling headings with empty bodies. So I can add an issue to track that change.
from khoj.
from khoj.
Note that Orgnode is very simplistic. The most accurate and fast Org
parser in the wild that I know of is https://github.com/tecosaur/Org.jl
Yeah, OrgNode is very basic. I've modified it for khoj
to handle more scenarios but it's pretty ad-hoc. Org.jl
looks interesting. I'm also tracking Org-Parser as they're being more methodical about parsing org syntax
from khoj.
Bookmarks to this repo looks like
***** SOMEDAY [#A] debanjum [Github] debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos :BOOKMARK:FLAGGED:misc:SOMEDAY:
:PROPERTIES:
:TITLE: debanjum/khoj: Natural Language Search Engine for your Org-Mode and Markdown notes, Beancount transactions and Photos
:BTYPE: misc
:ID: Github-debanjum-debanjum-khoj-natural-e4a
:AUTHOR: debanjum
:CREATED: [2022-09-07 Wed 21:51]
:HOWPUBLISHED: Github
:NOTE: Online; accessed 07 September 2022
:RSS: https://github.com/debanjum/khoj/commits.atom
:URL: https://github.com/debanjum/khoj
:END:
:LOGBOOK:
- Refiled on [2022-09-07 Wed 22:47]
:END:
I see. This will happen to get indexed as entries with logbook drawer notes get indexed. But I see the use-case for indexing entries with no body text in khoj
. Will add support for it soon
from khoj.
from khoj.
Ah, hadn't seen the org-parser perf concerns. The rest of the parsers info is very informative too. But yeah for khoj
nothing too fancy is required (currently). The parsing required to create index is expected to be done in the background, so speed should be less of a concern.
from khoj.
from khoj.
Yes, indexing takes quite a while for larger data sets. Most of this is due to the model generating embeddings. And not the actual file parsing itself.
PR #75 is meant to make this long indexing time only required the first time (or whenever a large amount of new data is to be indexed). But for subsequent runs it'll only re-index new or modified entries. This should speed up updating the index significantly. Enough hopefully so that the index can be updated automatically in the background from within the app itself 🤞🏾
from khoj.
... The parsing required to create index is expected to be done in the background, so speed should be less of a concern.
This may be problematic if there is a single large Org file and user makes changes to it followed by searching those changes.
Note: Currently the index has to be manually updated by the user (by calling the /regenerate
API endpoint). The user should not expect khoj
to search on the latest modified notes but on the last indexed notes. Even once automatic indexing is implemented, the index will lag the latest state of notes.
This shouldn't impact most practical use-cases IMO, as you're usually searching for older entries that you don't recall, not the latest edits to notes you may have just made.
from khoj.
@emacsomancer, @yantar92 I've merged fixes for the 2 main issues found on this thread to master. khoj
should now:
- Parse org files with no headings
- Throw error (with appropriate message) if no valid entries found
It'd be great if you can try the merged changes by upgrading to a pre-release build of khoj with:
pip install --upgrade --pre khoj-assistant
Let me know if this hasn't fixed the above issues for you all
from khoj.
That's good to know! Thanks for verifying
from khoj.
On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).
But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):
Traceback (most recent call last):
File "/home/slade/.local/bin/khoj", line 8, in <module>
sys.exit(run())
File "/home/slade/.local/lib/python3.10/site-packages/src/main.py", line 112, in run
configure_server(args, required=False)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 36, in configure_server
state.model = configure_search(state.model, state.config, args.regenerate)
File "/home/slade/.local/lib/python3.10/site-packages/src/configure.py", line 46, in configure_search
model.orgmode_search = text_search.setup(
File "/home/slade/.local/lib/python3.10/site-packages/src/search_type/text_search.py", line 173, in setup
text_to_jsonl(config.input_files, config.input_filter, config.compressed_jsonl)
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 32, in org_to_jsonl
entries, file_to_entries = extract_org_entries(org_files)
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/org_to_jsonl.py", line 72, in extract_org_entries
org_file_entries = orgnode.makelist(str(org_file))
File "/home/slade/.local/lib/python3.10/site-packages/src/processor/org_mode/orgnode.py", line 83, in makelist
for line in f:
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 676: invalid start byte
from khoj.
On the two files I had tested previously, now khoj runs without errors (though it still doesn't seem to actually index headers with no bodies, but expected given #87).
Thanks for validating! Good to know that the initial issue is resolved. I'll push an update to index header only entries soon.
But when I try running with the input-filter on a larger set of files, I am now encountering a new error (not sure what file is triggering it, as that doesn't appear as part of the error output):
And thanks for discovering another bug! :) Could you please open a separate Github issue to track this new error. It'll make it easier to track separate bugs (and fixes) for future reference
from khoj.
Related Issues (20)
- Роль, промт HOT 3
- [IDEA] Improve support for GitHub integration HOT 5
- [FIX] fresh self-hosted instance won't let me connect to it HOT 12
- [FIX] Timestamps on chat response are the same as the timestamp of the preceeding query HOT 4
- [FIX] Not all files in a folder are being indexed HOT 3
- Improve rendering for code blocks in the web UI
- Generate better reflective questions for users periodically
- Make three dot menu on conversation session persistent after it's clicked HOT 4
- In the config page, replace existing buttons with a single 'Sync' button HOT 1
- Improve the search page
- Add a synced icon to files in the desktop client if they are synced
- Fix profile icon in desktop application HOT 1
- [FIX] HOT 10
- Incorporate Gemini & Gemini Vision support HOT 1
- Selective Cross Reference HOT 1
- In-App Response Feedback
- [FIX] If more than 25 files are uploaded at once, the emacs client becomes unresponsive or the server responds with an error HOT 1
- [FIX] Issue with file filter being applied to the notes command
- [IDEA] Support exclusion file filters
- [FIX] Documents take a long time to start indexing from desktop app HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from khoj.