Giter Club home page Giter Club logo

Comments (4)

dosubot avatar dosubot commented on June 26, 2024 2

The issue you're encountering with the SentenceWindowNodeParser and the max_length parameter in the HuggingFaceEmbedding model is likely due to the way the SentenceWindowNodeParser handles large passages. The SentenceWindowNodeParser splits the document into nodes, each containing a window of surrounding sentences based on the window_size parameter. However, it excludes the window metadata and original text metadata from the embedding and LLM processes to avoid exceeding the max_length parameter.

Here's a breakdown of the relevant configurations and considerations:

  1. Window Size: Setting a large window_size (e.g., 14) can result in very large chunks, which might exceed the max_length parameter of the embedding model.

  2. Excluding Metadata: The SentenceWindowNodeParser excludes the window metadata and original text metadata from embedding and LLM processes to manage the token count effectively.

  3. Max Length: The max_length parameter in the HuggingFaceEmbedding model specifies the maximum length of the input. If this length is exceeded, it can cause errors.

Here is the relevant code snippet from the SentenceWindowNodeParser:

# exclude window metadata from embed and llm
node.excluded_embed_metadata_keys.extend(
    [self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
    [self.window_metadata_key, self.original_text_metadata_key]
)

To address your issue, ensure that the max_length parameter is appropriately set and that the SentenceWindowNodeParser is configured to handle large passages without exceeding this limit. You might need to adjust the window_size or further split the text to manage the token count effectively.

Additionally, consider the following performance optimizations:

  • Batch Size: Adjust the embed_batch_size to optimize processing.
  • Device Utilization: Ensure the model is using the appropriate device (e.g., GPU).
  • Normalization: Decide whether to normalize embeddings based on your use case.
  • Pooling Strategy: Choose an appropriate pooling strategy to reduce dimensionality.

By carefully configuring these parameters, you can manage large passages and avoid exceeding the max_length parameter [1][2].

To continue talking to Dosu, mention @dosu.

from llama_index.

dosubot avatar dosubot commented on June 26, 2024 2

Yes, you are correct. The max_length parameter in the HuggingFaceEmbedding model only controls the maximum length of the input for the embedding model itself. It does not directly limit the length of the sentence window nodes created by the SentenceWindowNodeParser.

The SentenceWindowNodeParser does not enforce a maximum token length on the nodes it generates. It simply concatenates the windowed sentences and stores them in the metadata without checking the length against any maximum limit [1].

To ensure that the nodes do not exceed the max_length parameter, you would need to implement additional logic to split or truncate the nodes accordingly before passing them to the embedding model. This could involve adjusting the window_size or adding a step to check and enforce the token length of each node.

To continue talking to Dosu, mention @dosu.

from llama_index.

Bearsaerker avatar Bearsaerker commented on June 26, 2024

@dosu Did I misunderstand that the value of "max_length" is only related to the embed model max length but not to the sentence window max length? Because that would explain that the chunks are still 2000+ Tokens, because the setting only relates to the embed model but not the length of the sentence window node

from llama_index.

Bearsaerker avatar Bearsaerker commented on June 26, 2024

Unfortunately I really don't know how github works, but I implemented a solution for my Problem. I guess this would be a pull request? The size of the sentence window is now also dependent on the set max token size. To make it completely irrelevant if you only want to use the sentence window, we could use a blatently high token count so that it gets ignored. But so far I like this very much.
Just for the matter of convention, how would it work if I'd suggest a code change on github @logan-markewich ? I'm really very new to this lol.
This is the new implementation of the sentence_window.py:

"""Simple node parser."""

from typing import Any, Callable, List, Optional, Sequence

from transformers import AutoTokenizer
from llama_index.core.bridge.pydantic import Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.node_parser.node_utils import (
    build_nodes_from_splits,
    default_id_func,
)
from llama_index.core.node_parser.text.utils import split_by_sentence_tokenizer
from llama_index.core.schema import BaseNode, Document
from llama_index.core.utils import get_tqdm_iterable

DEFAULT_WINDOW_SIZE = 3
DEFAULT_WINDOW_METADATA_KEY = "window"
DEFAULT_OG_TEXT_METADATA_KEY = "original_text"
DEFAULT_WINDOW_TOKEN_SIZE = 2000

class SentenceWindowNodeParser(NodeParser):
    """Sentence window node parser.

    Splits a document into Nodes, with each node being a sentence.
    Each node contains a window from the surrounding sentences in the metadata.

    Args:
        sentence_splitter (Optional[Callable]): splits text into sentences
        include_metadata (bool): whether to include metadata in nodes
        include_prev_next_rel (bool): whether to include prev/next relationships
    """

    sentence_splitter: Callable[[str], List[str]] = Field(
        default_factory=split_by_sentence_tokenizer,
        description="The text splitter to use when splitting documents.",
        exclude=True,
    )
    window_size: int = Field(
        default=DEFAULT_WINDOW_SIZE,
        description="The number of sentences on each side of a sentence to capture.",
        gt=0,
    )
    window_metadata_key: str = Field(
        default=DEFAULT_WINDOW_METADATA_KEY,
        description="The metadata key to store the sentence window under.",
    )
    original_text_metadata_key: str = Field(
        default=DEFAULT_OG_TEXT_METADATA_KEY,
        description="The metadata key to store the original sentence in.",
    )
    tokenizer: AutoTokenizer = Field(
        default_factory=lambda: AutoTokenizer.from_pretrained("mlabonne/NeuralDaredevil-8B-abliterated"),
        description="The tokenizer to use for counting tokens.",
        exclude=True,
    )
    window_token_size: int = Field(
        default=DEFAULT_WINDOW_TOKEN_SIZE,
        description="The maximum token size for the window.",
        gt=0,
    )

    @classmethod
    def class_name(cls) -> str:
        return "SentenceWindowNodeParser"

    @classmethod
    def from_defaults(
        cls,
        sentence_splitter: Optional[Callable[[str], List[str]]] = None,
        window_size: int = DEFAULT_WINDOW_SIZE,
        window_token_size: int = DEFAULT_WINDOW_TOKEN_SIZE,
        window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,
        original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,
        include_metadata: bool = True,
        include_prev_next_rel: bool = True,
        callback_manager: Optional[CallbackManager] = None,
        id_func: Optional[Callable[[int, Document], str]] = None,
    ) -> "SentenceWindowNodeParser":
        callback_manager = callback_manager or CallbackManager([])
        sentence_splitter = sentence_splitter or split_by_sentence_tokenizer()
        id_func = id_func or default_id_func

        return cls(
            sentence_splitter=sentence_splitter,
            window_size=window_size,
            window_token_size=window_token_size,
            window_metadata_key=window_metadata_key,
            original_text_metadata_key=original_text_metadata_key,
            include_metadata=include_metadata,
            include_prev_next_rel=include_prev_next_rel,
            callback_manager=callback_manager,
            id_func=id_func,
        )

    def _parse_nodes(
        self,
        nodes: Sequence[BaseNode],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> List[BaseNode]:
        """Parse document into nodes."""
        all_nodes: List[BaseNode] = []
        nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")

        for node in nodes_with_progress:
            nodes = self.build_window_nodes_from_documents([node])
            all_nodes.extend(nodes)

        return all_nodes

    def build_window_nodes_from_documents(
        self, documents: Sequence[Document]
    ) -> List[BaseNode]:
        """Build window nodes from documents."""
        all_nodes: List[BaseNode] = []
        for doc in documents:
            text = doc.text
            text_splits = self.sentence_splitter(text)
            nodes = build_nodes_from_splits(
                text_splits,
                doc,
                id_func=self.id_func,
            )

            # Add window to each node
            for i, node in enumerate(nodes):
                window_nodes = []
                window_token_count = 0
                for j in range(max(0, i - self.window_size), min(i + self.window_size + 1, len(nodes))):
                    window_node = nodes[j]
                    window_node_tokens = self.tokenizer(window_node.text, return_tensors="pt")
                    window_token_count += window_node_tokens.input_ids.size(-1)
                    if window_token_count > self.window_token_size:
                        break
                    window_nodes.append(window_node)

                node.metadata[self.window_metadata_key] = " ".join([n.text for n in window_nodes])
                node.metadata[self.original_text_metadata_key] = node.text

                # Exclude window metadata from embed and LLM
                node.excluded_embed_metadata_keys.extend(
                    [self.window_metadata_key, self.original_text_metadata_key]
                )
                node.excluded_llm_metadata_keys.extend(
                    [self.window_metadata_key, self.original_text_metadata_key]
                )

        all_nodes.extend(nodes)
        return all_nodes

from llama_index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.