mfarragher / obsidiantools Goto Github PK

Obsidian tools - a Python package for analysing an Obsidian.md vault

License: Other

Python 100.00%

data-science knowledge-management network-analysis note-taking obsidian-community obsidian-md python

obsidiantools's Introduction

👋 Hello

I'm a data professional and open-source developer with expertise in healthcare, economics and statistical inference. I have experience with data science projects in sectors such as e-commerce, education & healthcare. 👛📚🩺 I'm currently a Digital Analytics Manager at Utility Warehouse (⚡) and recently completed my master's degree in MPhil Population Health Sciences (Health Data Science stream) at the University of Cambridge. 🎓

These are data science packages and apps I've developed:

appelpy : Python package for easier regression modelling
obsidiantools : Python package for analysing Obsidian.md vaults

I enjoy speaking about data, writing about data projects and sharing resources about data! ⚗

A few pieces of tech I have used frequently:

Category	Tech
🏗Data science
💻OS
🧠Knowledge

obsidiantools's People

Contributors

Stargazers

Watchers

obsidiantools's Issues

[FR] Options : choose to use file name / frontmatter title for graph

I noticed that the graph created use the filepath, and I want to choose the frontmatter title or the filename instead.
How can I do that ?

Graphic reference :

Generated using pyvis

get_md_relpaths_from_dir() globbing issue

Hello. Thanks for this library

I have tried this with 3.9 as suggested. But I'm getting an error straight away on gather.

File "/home/steve/.pyenv/versions/obsidian-python3.9.0/lib/python3.9/site-packages/obsidiantools/md_utils.py", line 29, in get_md_relpaths_from_dir
    for p in glob(str(dir_path / '**/*.md'), recursive=True)]
TypeError: unsupported operand type(s) for /: 'str' and 'str'

I have to change the /' to a + in the line in get_md_relpaths_from_dir() to get the globbing working!

change from

return [Path(p).relative_to(dir_path)
for p in glob(str(dir_path / '**/*.md'), recursive=True)]

return [Path(p).relative_to(dir_path)
for p in glob(str(dir_path + '**/*.md'), recursive=True)]

Detecting tabs and returns

following using the 'get_source_text' function, it seems the tabs and returns in the file are ignored. Is there a way to detect where they were?

Crash on _get_wikilinks_index

Hey, I'm not sure if this is related to any existing issue, maybe #18 ?

Traceback (most recent call last):
  File "semantic/semantic_search_api.py", line 42, in <module>
    vault = otools.Vault(wkd).connect().gather()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/api.py", line 228, in connect
    wiki_link_map = self._get_wikilinks_index()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/api.py", line 508, in _get_wikilinks_index
    return {k: get_wikilinks(self._dirpath / v)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/api.py", line 508, in <dictcomp>
    return {k: get_wikilinks(self._dirpath / v)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/md_utils.py", line 106, in get_wikilinks
    src_txt = _get_source_text_from_md_file(filepath, remove_code=True)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/md_utils.py", line 288, in _get_source_text_from_md_file
    html = _get_html_from_md_file(
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/md_utils.py", line 268, in _get_html_from_md_file
    _, content = _get_md_front_matter_and_content(
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/obsidiantools/md_utils.py", line 249, in _get_md_front_matter_and_content
    return frontmatter.parse(file_string)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/frontmatter/__init__.py", line 82, in parse
    fm = handler.load(fm)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/frontmatter/default_handlers.py", line 238, in load
    return yaml.load(fm, **kwargs)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 133, in compose_mapping_node
    item_value = self.compose_node(node, item_key)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 82, in compose_node
    node = self.compose_sequence_node(anchor)
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/composer.py", line 110, in compose_sequence_node
    while not self.check_event(SequenceEndEvent):
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/Users/louisbeaumont/Documents/brain/.obsidian/plugins/obsidian-ava/env/lib/python3.8/site-packages/yaml/parser.py", line 483, in parse_flow_sequence_entry
    raise ParserError("while parsing a flow sequence", self.marks[-1],
yaml.parser.ParserError: while parsing a flow sequence
  in "<unicode string>", line 3, column 10:
    aliases: [researchgate.net,(PDF) the Capa ... 
             ^
expected ',' or ']', but got '?'
  in "<unicode string>", line 3, column 98:
     ...  There Any Limit to Human Memory?]

get_wikilink_counts() not implemented

Hi @mfarragher I just noticed that get_wikilink_counts() has not been implemented. The current demo suggests that functionality for wikilinks would be consistent with functionality for backlinks. I hope that's useful feedback.

[FR] support python3.8

because netlify & vercel & default ubuntu 20.04 's python version is 3.8, so it would be nice to support python 3.8 , since it is a widely use version.

Incremental refresh

Hey, for https://github.com/louis030195/obsidian-ava, I'm trying to implement increment refresh of the state of the vault.

Concretely, I build sentence embeddings of the whole vault and would like to re-compute embeddings every time a note is updated/deleted/created.

Do you see any way of doing this incrementally rather than reloading the vault and recomputing everything every time? (It takes ~1 min on mps device on my 500k words vault)

Ideally, I'd see maybe an API that let me listen to vault changes with callback(s) in this library?

Thanks 🚀😃

`TypeError: 'NoneType' object is not iterable` (in `_remove_front_matter`)

Running the following on my vault:

import obsidiantools.api as ot
vault = ot.Vault(Path("/path/to/a/vault").connect()

Results in:

# --->8--- Irrelevant frames omitted --->8---

~/.cache/pypoetry/virtualenvs/knowledgebase-scripts-Fe_uWe_V-py3.9/lib/python3.9/site-packages/obsidiantools/md_utils.py in _get_ascii_plaintext_from_md_file(filepath)
    190     html = _get_html_from_md_file(filepath)
    191     # strip out front matter (if any):
--> 192     html = _remove_front_matter(html)
    193     return _get_ascii_plaintext_from_html(html)
    194 

~/.cache/pypoetry/virtualenvs/knowledgebase-scripts-Fe_uWe_V-py3.9/lib/python3.9/site-packages/obsidiantools/md_utils.py in _remove_front_matter(html)
    201     if hr_content:
    202         # wipe out content from first hr (the front matter)
--> 203         for fm_detail in hr_content.find_next("p"):
    204             fm_detail.extract()
    205         # then wipe all hr elements

TypeError: 'NoneType' object is not iterable

A quick roundtrip in a debugger shows this happens with at least:

Notes containing an hr (---) but no YAML frontmatter.
Notes containing only frontmatter, no body.

performance - file opens & reads

Hi.

Every markdown file is being opened & read a total of 8 times in normal connect & gather flow.
Might make sense to model a note as a class and have it load its own data once.

regex improvements (tags)

The current tags_regex is not parsing nested tags for me
The proposed regex will also parse all tags from "sussudio.md" example without needing to modify the raw text beforehand.
Would be a good improvement to move all the regexes into a constants.py file and load them from there.

Current regex:
tags_regex=r'(?<!\()#{1}([A-z]+[0-9_\-]*[A-Z0-9]?)\/?'

Proposed Regex:
tags_regex=r'(?<!\()(?<!\\)#{1}([A-z]+[0-9_\-]*[A-Z0-9]?[^\s]+(?![^\[\[]*\]\]))\/?'

UnicodeDecodeError doesn't provide file details

While using vault = otools.Vault(obsidian_vault).connect().gather() to connect to my vault, it runs then gives a UnicodeDecodeError. I like how it provides errors about YAML header problems, but this one is hard to debug.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 90: invalid start byte

There isn't enough information in the error to know where the file loading failed. Is there a DEBUG flag or way to include the filename in error?

Vault initialization fails when KB path is string

When I initialize the vault with a string path, it fails:

import obsidiantools.api as otools
import os

path = os.path.expanduser("~/kb")
vault = otools.Vault(path).connect()

gives

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/4t/bqbt3lbs0x7fg8gb6bzvthq00000gn/T/ipykernel_30638/336964736.py in <module>
      3 
      4 path = os.path.expanduser("~/kb")
----> 5 vault = otools.Vault(path).connect()

~/repos/obsidiantools/obsidiantools/api.py in __init__(self, dirpath)
     69         """
     70         self._dirpath = dirpath
---> 71         self._file_index = self._get_md_relpaths_by_name()
     72 
     73         # graph setup

~/repos/obsidiantools/obsidiantools/api.py in _get_md_relpaths_by_name(self)
    323             dict
    324         """
--> 325         return {f.stem: f for f in self._get_md_relpaths()}
    326 
    327     def _get_wikilinks_index(self):

~/repos/obsidiantools/obsidiantools/api.py in _get_md_relpaths(self)
    313             list
    314         """
--> 315         return get_md_relpaths_from_dir(self._dirpath)
    316 
    317     def _get_md_relpaths_by_name(self):

~/repos/obsidiantools/obsidiantools/md_utils.py in get_md_relpaths_from_dir(dir_path)
     23     """
     24     return [Path(p).relative_to(dir_path)
---> 25             for p in glob(str(dir_path / '**/*.md'), recursive=True)]
     26 
     27 

TypeError: unsupported operand type(s) for /: 'str' and 'str'

I would expect the initializer to either work with a string path, or explicitly declare the required type in the function as pathlib.Path, and emphasize that in the documentation.

I'm on macOS, Python 3.9.1, obsidiantools 5c86662.

Parse dataview metadata

Many obsidian users (according to downloads) install the obsidian dataview plugin.
It have a different metadata syntax, that you can check at the documentation: https://blacksmithgu.github.io/obsidian-dataview/annotation/add-metadata/.

Also, we can reimplement python code from ts: https://github.com/blacksmithgu/obsidian-dataview/blob/master/src/data-import/markdown-file.ts

Tags in code blocks are taken

As a placeholder.
I think code blocks should be ignored for tags? What do you think?

 "file_tags": [
        "meta",
        "idea",
        "shower-thought",
        "to-digest",
        "shower-thought",
        "introduction",
        "shower-thought\"",
        "guru\"",
        "shroedinger-uncertain\"",
        "floating-point-error\"",
        "socratic\""
      ]

UnicodeDecodeError when connecting to Obsidian Vault

Hi, I'm testing out the package and I'm getting an error when I try to connect via Jupyter notebook in Windows 10: vault = otools.Vault(vault_dir).connect().gather().

I'm receiving the following UnicodeDecodeError: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1400: character maps to <undefined>.

Assuming the filepath and connection are working I'm unclear whether it is a problem I can correct in Obsidian or if it is a problem with the parser used by obsidiantools. Can you advise how I can resolve the issue?

For reference the stack trace is as follows:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19592/2568229718.py in <module>
----> 1 vault = otools.Vault(vault_dir).connect().gather()
      2 print(f"Connected?: {vault.is_connected}")
      3 print(f"Gathered?:  {vault.is_gathered}")

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in connect(self)
    199         if not self._is_connected:
    200             # default graph to mirror Obsidian's link counts
--> 201             wiki_link_map = self._get_wikilinks_index()
    202             G = nx.MultiDiGraph(wiki_link_map)
    203             self._graph = G

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in _get_wikilinks_index(self)
    438         where k is the md filename
    439         and v is list of ALL wikilinks found in k"""
--> 440         return {k: get_wikilinks(self._dirpath / v)
    441                 for k, v in self._file_index.items()}
    442 

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in <dictcomp>(.0)
    438         where k is the md filename
    439         and v is list of ALL wikilinks found in k"""
--> 440         return {k: get_wikilinks(self._dirpath / v)
    441                 for k, v in self._file_index.items()}
    442 

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in get_wikilinks(filepath)
     92         list of strings
     93     """
---> 94     plaintext = _get_ascii_plaintext_from_md_file(filepath, remove_code=True)
     95 
     96     wikilinks = _get_all_wikilinks_from_html_content(

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_ascii_plaintext_from_md_file(filepath, remove_code)
    265     """md file -> html -> ASCII plaintext"""
    266     # strip out front matter (if any):
--> 267     html = _get_html_from_md_file(filepath)
    268     if remove_code:
    269         html = _remove_code(html)

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_html_from_md_file(filepath)
    251 def _get_html_from_md_file(filepath):
    252     """md file -> html (without front matter)"""
--> 253     _, content = _get_md_front_matter_and_content(filepath)
    254     return markdown.markdown(content, output_format='html')
    255 

~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_md_front_matter_and_content(filepath)
    242     with open(filepath) as f:
    243         try:
--> 244             front_matter, content = frontmatter.parse(f.read())
    245         except yaml.scanner.ScannerError:
    246             # for invalid YAML, return the whole file as content:

~\anaconda3\envs\Obsidian_Tools\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1400: character maps to <undefined>

Frontmatter parsing error handling not robust

Frontmatter parsing can throw different exceptions than yaml.scanner.ScannerError. For example, I was testing this library on random Obsidian vaults I found on Github. It throws a ConstructorError on this file https://github.com/valentine195/Obsidian-Vault/blob/89abc098287aa7df0b2735cb229d15897d28c40d/7.%20Assets/Templates/group.md?plain=1#L1-L6 because it tries to parse {{groupTag}} as a YAML mapping. (It seems to be intended as some template for an Obsidian plugin?)

Anyway, since you're just eating the exception it's probably better to just catch any exception.

ValueError: Length of values (...) does not match length of index (...)

Thank you Mark for making Obsidian more accessible to Python users!! :-)

I was giving it a try with 40.076 files (incl. attachments). (Most of the MD files are generated and do not yet contain a lot of links and metadata.)

The method "gather" ran successfully in about 3 minutes! :-)

However, df = vault.get_all_file_metadata showed an error message.
Not sure if the following is of help to locate an issue.

ValueError Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 df = vault.get_all_file_metadata()

File C:...\obsidiantools\api.py:1345, in Vault.get_all_file_metadata(self)
1343 warnings.warn('Only notes (md files) were used to build the graph. Set attachments=True in the connect method to show all file metadata.')
1344 else:
-> 1345 df_media = self.get_media_file_metadata()
1346 df_media['graph_category'] = np.where(
1347 df_media['file_exists'], 'attachment', 'nonexistent')
1348 df_canvas = self.get_canvas_file_metadata()

File C:...\obsidiantools\api.py:1234, in Vault.get_media_file_metadata(self)
1232 return df
1233 else:
-> 1234 df = df.pipe(self._create_media_file_metadata_columns)
1235 return df

File C:...\pandas\core\generic.py:5512, in NDFrame.pipe(self, func, *args, **kwargs)
5454 @Final
5455 @doc(klass=_shared_doc_kwargs["klass"])
5456 def pipe(
(...)
5460 **kwargs,
5461 ) -> T:
5462 r"""
5463 Apply chainable functions that expect Series or DataFrames.
5464
(...)
5510 ... ) # doctest: +SKIP
5511 """
-> 5512 return com.pipe(self, func, *args, **kwargs)

File C:...\pandas\core\common.py:497, in pipe(obj, func, *args, **kwargs)
495 return func(*args, **kwargs)
496 else:
--> 497 return func(obj, *args, **kwargs)

File C:...\obsidiantools\api.py:1249, in Vault._create_media_file_metadata_columns(self, df)
1242 df['abs_filepath'] = np.where(df['rel_filepath'].notna(),
1243 [self._dirpath / str(f)
1244 for f in df['rel_filepath'].tolist()],
1245 np.NaN)
1246 df['file_exists'] = pd.Series(
1247 np.logical_not(df.index.isin(self._nonexistent_media_files)),
1248 index=df.index)
-> 1249 df['n_backlinks'] = self._get_backlink_counts_for_media_files_only()
1250 df['modified_time'] = pd.to_datetime(
1251 [f.lstat().st_mtime if not pd.isna(f)
1252 else pd.NaT
1253 for f in df['abs_filepath'].tolist()],
1254 unit='s')
1255 return df

File C:...\pandas\core\frame.py:3655, in DataFrame.setitem(self, key, value)
3652 self._setitem_array([key], value)
3653 else:
3654 # set column
-> 3655 self._set_item(key, value)

File C:...\pandas\core\frame.py:3832, in DataFrame._set_item(self, key, value)
3822 def _set_item(self, key, value) -> None:
3823 """
3824 Add series to DataFrame in specified column.
3825
(...)
3830 ensure homogeneity.
3831 """
-> 3832 value = self._sanitize_column(value)
3834 if (
3835 key in self.columns
3836 and value.ndim == 1
3837 and not is_extension_array_dtype(value)
3838 ):
3839 # broadcast across multiple columns if necessary
3840 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File C:...\pandas\core\frame.py:4538, in DataFrame._sanitize_column(self, value)
4535 return _reindex_for_setitem(value, self.index)
4537 if is_list_like(value):
-> 4538 com.require_length_match(value, self.index)
4539 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File C:...\pandas\core\common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )

ValueError: Length of values (38135) does not match length of index (4216)

Handing of Self-Referencial Links like TOC

While doing a graph walk it seems the TOC notation causes trouble as the result are notes that link to nothing.

There could be two outcomes:

A note that refers to itself using `[[#...|]] notation may not result on a valid edge
A note that refers to itself generates as many self referential edges as entry in the TOC.

There is also another consideration: a note can link to another note using the "[[#" syntax which is the same as used for TOC. Does that create an edge between the two notes?

Reference to non-md files

It seems that obsidiantools does not track non-markdown files (pictures, etc). As a consequence, vault.nonexistent_notes list all references to such files. I suggest to include non-markdown files in the graph as well.

And, on a related note, vault.nonexistent_notes wrongly includes notes that are referenced with extension. For example, a reference of the form [[note.md]] leads to note.md being listed as non-existent even if the file note.md exists.

Can't build an exe with pyinstaller

``I am building a python app with the obsidiantools library. Everything works fine when I run it with python but when I try to build an exe of the application I get this error:

I tried to run:

 pyinstaller app.py --onefile --paths=C:\Users\Alessandro\OneDrive\Desktop\programmazione\myapp\env\Lib\site-packages

Tags from frontmatter aren't read as tags

There's a frontmatter field called "tags" that contains an array of strings that Obsidian treats like tags created using a hashtag in the notes content. Currently obsidiantools doesn't include this information when building the tags_index. (See https://help.obsidian.md/Editing+and+formatting/Tags)

It would be great if this feature could be added.

Handle .md inside wikilinks to reflect Obsidian graph

[[Foo]] and [[Bar.md]] will both be related to note 'Foo' in the knowledge graph.

Currently, wikilinks getters will extract the wikilinks as 'Foo' and 'Bar.md'.
The expected behaviour of getters to reflect Obsidian's behaviour is 'Foo' and 'Bar' respectively.

error on malformed frontmatter

in case of a malformed frontmatter in a document an exception is raised and not handled.

error can be handled in

obsidiantools/obsidiantools/md_utils.py

Line 259 in ddd7866
adding the following solves the problem (allthough the specific error should be named)

except:
    print("problem with file ", filepath)

Markdown parser cannot run on v0.8.0 (for one environment)

On some environments I'm having issues with getting markdown.markdown to work at all. Including on markdown v3.3 and the latest v3.4. It is an issue with how extensions are registered.

Debugging output does not point to a specific extension having a problem.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 markdown.markdown('x', output_format='html',
      2                              extensions=['pymdownx.arithmatex',
      3                                          'pymdownx.mark',
      4                                          'pymdownx.tilde',
      5                                          'pymdownx.saneheaders',
      6                                          'footnotes',
      7                                          'md_mermaid',
      8                                          'sane_lists',
      9                                          'tables'],
     10                              extension_configs={'pymdownx.tilde':
     11                                                 {'subscript': False}})

File ~/miniconda3/envs/nlpkit/lib/python3.8/site-packages/markdown/core.py:386, in markdown(text, **kwargs)
    371 def markdown(text, **kwargs):
    372     """Convert a markdown string to HTML and return HTML as a unicode string.
    373 
    374     This is a shortcut function for `Markdown` class to cover the most
   (...)
    384 
    385     """
--> 386     md = Markdown(**kwargs)
    387     return md.convert(text)

File ~/miniconda3/envs/nlpkit/lib/python3.8/site-packages/markdown/core.py:96, in Markdown.__init__(self, **kwargs)
     94 self.references = {}
     95 self.htmlStash = util.HtmlStash()
---> 96 self.registerExtensions(extensions=kwargs.get('extensions', []),
     97                         configs=kwargs.get('extension_configs', {}))
     98 self.set_output_format(kwargs.get('output_format', 'xhtml'))
     99 self.reset()

File ~/miniconda3/envs/nlpkit/lib/python3.8/site-packages/markdown/core.py:125, in Markdown.registerExtensions(self, extensions, configs)
    123     ext = self.build_extension(ext, configs.get(ext, {}))
    124 if isinstance(ext, Extension):
--> 125     ext.extendMarkdown(self)
    126     logger.debug(
    127         'Successfully loaded extension "%s.%s".'
    128         % (ext.__class__.__module__, ext.__class__.__name__)
    129     )
    130 elif ext is not None:

TypeError: extendMarkdown() missing 1 required positional argument: 'md_globals'

Unable to filter index using Windows filepath with include_subdirs=[]

Hi,

I can successfully view my vault file index in Windows. If I then try to filter the list by subdirectory I can successfully list notes in the root and in the 'docs' folders. If I filter by the name of a lower subdirectory using a Windows filepath the returned list is empty.

For example, my file index includes the following list items:

{'README': WindowsPath('README.md'),
 'index': WindowsPath('docs/index.md'),
 'Quotations': WindowsPath('docs/Quotations.md'),
 'Creative Commons': WindowsPath('docs/Concepts/Creative Commons.md'),
 'Crowdsourcing': WindowsPath('docs/Concepts/Crowdsourcing.md'),
 'Data Format': WindowsPath('docs/Concepts/Data Format.md'),
 'Data Model': WindowsPath('docs/Concepts/Data Model.md'),
 'Data Sovereignty': WindowsPath('docs/Concepts/Data Sovereignty.md'),
}

Based on the obsidiantools-demo I would expect to be able to list all the markdown files in the 'Concepts' folder using the following call:

(otools.Vault(vault_dir, include_subdirs=['docs/Concepts'], include_root=False)
.file_index)

Instead the returned object is empty {}.

Reversing the slash to create a linux path resolves the issue:

(otools.Vault(vault_dir, include_subdirs=['docs\Concepts'], include_root=False)
 .file_index)

Returns:

{'Creative Commons': WindowsPath('docs/Concepts/Creative Commons.md'),
 'Crowdsourcing': WindowsPath('docs/Concepts/Crowdsourcing.md'),
 'Data Format': WindowsPath('docs/Concepts/Data Format.md'),
 'Data Model': WindowsPath('docs/Concepts/Data Model.md'),
 'Data Sovereignty': WindowsPath('docs/Concepts/Data Sovereignty.md')}

Ideally this would be resolved by the obsidiantools package rather than the user. Alternatively suggest updating the documentation.

Fails to parse a note that only has frontmatter

I have a note with the following contents in my KB:

---
aliases: [Product-Market fit]
---

Parsing it fails with the following trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/4t/bqbt3lbs0x7fg8gb6bzvthq00000gn/T/ipykernel_30638/159641224.py in <module>
      8 path = pathlib.Path(os.path.expanduser("~/kb"))
      9 print(path)
---> 10 vault = otools.Vault(path).connect()

~/repos/obsidiantools/obsidiantools/api.py in connect(self)
    159         if not self._is_connected:
    160             # default graph to mirror Obsidian's link counts
--> 161             wiki_link_map = self._get_wikilinks_index()
    162             G = nx.MultiDiGraph(wiki_link_map)
    163             self._graph = G

~/repos/obsidiantools/obsidiantools/api.py in _get_wikilinks_index(self)
    329         where k is the md filename
    330         and v is list of ALL wikilinks found in k"""
--> 331         return {k: get_wikilinks(self._dirpath / v)
    332                 for k, v in self._file_index.items()}
    333 

~/repos/obsidiantools/obsidiantools/api.py in <dictcomp>(.0)
    329         where k is the md filename
    330         and v is list of ALL wikilinks found in k"""
--> 331         return {k: get_wikilinks(self._dirpath / v)
    332                 for k, v in self._file_index.items()}
    333 

~/repos/obsidiantools/obsidiantools/md_utils.py in get_wikilinks(filepath)
     46     """
     47     print(filepath)
---> 48     plaintext = _get_ascii_plaintext_from_md_file(filepath)
     49 
     50     wikilinks = _get_all_wikilinks_from_html_content(

~/repos/obsidiantools/obsidiantools/md_utils.py in _get_ascii_plaintext_from_md_file(filepath)
    191     html = _get_html_from_md_file(filepath)
    192     # strip out front matter (if any):
--> 193     html = _remove_front_matter(html)
    194     return _get_ascii_plaintext_from_html(html)
    195 

~/repos/obsidiantools/obsidiantools/md_utils.py in _remove_front_matter(html)
    202     if hr_content:
    203         # wipe out content from first hr (the front matter)
--> 204         for fm_detail in hr_content.find_next('p'):
    205             fm_detail.extract()
    206         # then wipe all hr elements

TypeError: 'NoneType' object is not iterable

I'm on macOS, Python 3.9.1, obsidiantools 5c86662.

TypeError: unsupported operand type(s) for /: 'str' and 'str'

Dunno what in my notes could possibly be triggering this error. But it happens whenever I try and use the get_note_metadata() method on my vault.

Text goes missing even though the HTML is OK (html2text parsing issues)

For one of my notes with a mix of tables, LaTeX, lists & code blocks, there is a lot of text from the note that isn't captured in source_text_index, but is kept in the HTML. This suggests some parsing issues with how html2text is configured.

Whole paragraph blocks & headers can be completely missing.

This starts to happen after a table with LaTeX. Anything in body text (<p>) afterwards is missing, yet it keeps all the remaining LaTeX (even the stuff in tables).

Perhaps it doesn't like MathJax? Maybe wiping out a few tags from HTML, for the source_text functionality, before it gets processed by html2text could make the output smoother in this case.

Need to think more about:

What HTML tags are not necessary for source_text?
- LaTeX is one aspect to remove if causing problems. Keep as much as possible for html2text to handle (including URLs, images, etc.). Anything more opinionated (e.g. do we want strikethrough text or not) would be better covered in readable_text.
- May involve another Markdown class if switching off markdown extensions, more functions to do this specific HTML generation, etc.
A test case from reduced format of my note

mfarragher / obsidiantools Goto Github PK

obsidiantools's Introduction

👋 Hello

obsidiantools's People

Contributors

Stargazers

Watchers

Forkers

obsidiantools's Issues

Recommend Projects

Recommend Topics

Recommend Org