Comments (4)
The issue you're encountering with the SentenceWindowNodeParser
and the max_length
parameter in the HuggingFaceEmbedding
model is likely due to the way the SentenceWindowNodeParser
handles large passages. The SentenceWindowNodeParser
splits the document into nodes, each containing a window of surrounding sentences based on the window_size
parameter. However, it excludes the window metadata and original text metadata from the embedding and LLM processes to avoid exceeding the max_length
parameter.
Here's a breakdown of the relevant configurations and considerations:
-
Window Size: Setting a large
window_size
(e.g., 14) can result in very large chunks, which might exceed themax_length
parameter of the embedding model. -
Excluding Metadata: The
SentenceWindowNodeParser
excludes the window metadata and original text metadata from embedding and LLM processes to manage the token count effectively. -
Max Length: The
max_length
parameter in theHuggingFaceEmbedding
model specifies the maximum length of the input. If this length is exceeded, it can cause errors.
Here is the relevant code snippet from the SentenceWindowNodeParser
:
# exclude window metadata from embed and llm
node.excluded_embed_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
To address your issue, ensure that the max_length
parameter is appropriately set and that the SentenceWindowNodeParser
is configured to handle large passages without exceeding this limit. You might need to adjust the window_size
or further split the text to manage the token count effectively.
Additionally, consider the following performance optimizations:
- Batch Size: Adjust the
embed_batch_size
to optimize processing. - Device Utilization: Ensure the model is using the appropriate device (e.g., GPU).
- Normalization: Decide whether to normalize embeddings based on your use case.
- Pooling Strategy: Choose an appropriate pooling strategy to reduce dimensionality.
By carefully configuring these parameters, you can manage large passages and avoid exceeding the max_length
parameter [1][2].
To continue talking to Dosu, mention @dosu.
from llama_index.
Yes, you are correct. The max_length
parameter in the HuggingFaceEmbedding
model only controls the maximum length of the input for the embedding model itself. It does not directly limit the length of the sentence window nodes created by the SentenceWindowNodeParser
.
The SentenceWindowNodeParser
does not enforce a maximum token length on the nodes it generates. It simply concatenates the windowed sentences and stores them in the metadata without checking the length against any maximum limit [1].
To ensure that the nodes do not exceed the max_length
parameter, you would need to implement additional logic to split or truncate the nodes accordingly before passing them to the embedding model. This could involve adjusting the window_size
or adding a step to check and enforce the token length of each node.
To continue talking to Dosu, mention @dosu.
from llama_index.
@dosu Did I misunderstand that the value of "max_length" is only related to the embed model max length but not to the sentence window max length? Because that would explain that the chunks are still 2000+ Tokens, because the setting only relates to the embed model but not the length of the sentence window node
from llama_index.
Unfortunately I really don't know how github works, but I implemented a solution for my Problem. I guess this would be a pull request? The size of the sentence window is now also dependent on the set max token size. To make it completely irrelevant if you only want to use the sentence window, we could use a blatently high token count so that it gets ignored. But so far I like this very much.
Just for the matter of convention, how would it work if I'd suggest a code change on github @logan-markewich ? I'm really very new to this lol.
This is the new implementation of the sentence_window.py:
"""Simple node parser."""
from typing import Any, Callable, List, Optional, Sequence
from transformers import AutoTokenizer
from llama_index.core.bridge.pydantic import Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.node_parser.node_utils import (
build_nodes_from_splits,
default_id_func,
)
from llama_index.core.node_parser.text.utils import split_by_sentence_tokenizer
from llama_index.core.schema import BaseNode, Document
from llama_index.core.utils import get_tqdm_iterable
DEFAULT_WINDOW_SIZE = 3
DEFAULT_WINDOW_METADATA_KEY = "window"
DEFAULT_OG_TEXT_METADATA_KEY = "original_text"
DEFAULT_WINDOW_TOKEN_SIZE = 2000
class SentenceWindowNodeParser(NodeParser):
"""Sentence window node parser.
Splits a document into Nodes, with each node being a sentence.
Each node contains a window from the surrounding sentences in the metadata.
Args:
sentence_splitter (Optional[Callable]): splits text into sentences
include_metadata (bool): whether to include metadata in nodes
include_prev_next_rel (bool): whether to include prev/next relationships
"""
sentence_splitter: Callable[[str], List[str]] = Field(
default_factory=split_by_sentence_tokenizer,
description="The text splitter to use when splitting documents.",
exclude=True,
)
window_size: int = Field(
default=DEFAULT_WINDOW_SIZE,
description="The number of sentences on each side of a sentence to capture.",
gt=0,
)
window_metadata_key: str = Field(
default=DEFAULT_WINDOW_METADATA_KEY,
description="The metadata key to store the sentence window under.",
)
original_text_metadata_key: str = Field(
default=DEFAULT_OG_TEXT_METADATA_KEY,
description="The metadata key to store the original sentence in.",
)
tokenizer: AutoTokenizer = Field(
default_factory=lambda: AutoTokenizer.from_pretrained("mlabonne/NeuralDaredevil-8B-abliterated"),
description="The tokenizer to use for counting tokens.",
exclude=True,
)
window_token_size: int = Field(
default=DEFAULT_WINDOW_TOKEN_SIZE,
description="The maximum token size for the window.",
gt=0,
)
@classmethod
def class_name(cls) -> str:
return "SentenceWindowNodeParser"
@classmethod
def from_defaults(
cls,
sentence_splitter: Optional[Callable[[str], List[str]]] = None,
window_size: int = DEFAULT_WINDOW_SIZE,
window_token_size: int = DEFAULT_WINDOW_TOKEN_SIZE,
window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,
original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,
include_metadata: bool = True,
include_prev_next_rel: bool = True,
callback_manager: Optional[CallbackManager] = None,
id_func: Optional[Callable[[int, Document], str]] = None,
) -> "SentenceWindowNodeParser":
callback_manager = callback_manager or CallbackManager([])
sentence_splitter = sentence_splitter or split_by_sentence_tokenizer()
id_func = id_func or default_id_func
return cls(
sentence_splitter=sentence_splitter,
window_size=window_size,
window_token_size=window_token_size,
window_metadata_key=window_metadata_key,
original_text_metadata_key=original_text_metadata_key,
include_metadata=include_metadata,
include_prev_next_rel=include_prev_next_rel,
callback_manager=callback_manager,
id_func=id_func,
)
def _parse_nodes(
self,
nodes: Sequence[BaseNode],
show_progress: bool = False,
**kwargs: Any,
) -> List[BaseNode]:
"""Parse document into nodes."""
all_nodes: List[BaseNode] = []
nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")
for node in nodes_with_progress:
nodes = self.build_window_nodes_from_documents([node])
all_nodes.extend(nodes)
return all_nodes
def build_window_nodes_from_documents(
self, documents: Sequence[Document]
) -> List[BaseNode]:
"""Build window nodes from documents."""
all_nodes: List[BaseNode] = []
for doc in documents:
text = doc.text
text_splits = self.sentence_splitter(text)
nodes = build_nodes_from_splits(
text_splits,
doc,
id_func=self.id_func,
)
# Add window to each node
for i, node in enumerate(nodes):
window_nodes = []
window_token_count = 0
for j in range(max(0, i - self.window_size), min(i + self.window_size + 1, len(nodes))):
window_node = nodes[j]
window_node_tokens = self.tokenizer(window_node.text, return_tensors="pt")
window_token_count += window_node_tokens.input_ids.size(-1)
if window_token_count > self.window_token_size:
break
window_nodes.append(window_node)
node.metadata[self.window_metadata_key] = " ".join([n.text for n in window_nodes])
node.metadata[self.original_text_metadata_key] = node.text
# Exclude window metadata from embed and LLM
node.excluded_embed_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
all_nodes.extend(nodes)
return all_nodes
from llama_index.
Related Issues (20)
- [Bug]: How to use the same nodes for dense_x second time. Please Provide standard solution to resolve this bug HOT 1
- [good first issue]: fix VectorIndexAutoRetriever default "prompt_template_str" HOT 3
- [Documentation]: Faulty link in building a basic agent HOT 1
- [Bug]: ValueError: Query ID Not Found on Re-execution with ChromaDB Embeddings HOT 1
- [Feature Request]: Openai_like embedding model integration HOT 1
- [Website Bug]: Link to phi-3-mini-4k-instruct collab notebook in docs/benchmark is broken in official website HOT 2
- [Question]: Can ReAct Agent reasoning without tools HOT 1
- [Question]: Extracting Implicit Information from Vector Databases Using Llama Index's RAG Setup HOT 1
- [Feature Request]: 'aget_text_embedding_batch' controls concurrency by semaphore.
- [Question]: Recursive retriever for ObjectIndex class HOT 1
- [Question]: firecrawl_reader error HOT 3
- [Feature Request]: Global progress bar for index constructing
- [Bug]: ChromaVectorStore ._get() function extracts only the first character of the node id_ HOT 5
- [Bug]: Something needs to be updated with chromadb HOT 5
- [Bug]: No module named llama_index HOT 3
- [Question]: HuggingFace model from local file path HOT 4
- [Bug]: Not able to import Llama index modules in tensor rt llms docker image. HOT 1
- [Bug]: The documentation bot is not updated HOT 1
- [Question]: Why when I query my index, does it only index 2 files when I have 10 csvs? HOT 5
- [Question]: how to load multiple html files to get unstructured and structured table in a html page HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama_index.