Giter Club home page Giter Club logo

Comments (12)

polm avatar polm commented on June 12, 2024

Sorry you're having trouble with this.

Does the listed matrix.bin file exist? Have you tried re-installing?

mmap related failures usually have to do with running out of memory due to creating too many Tagger objects, but in this case it looks like the file may not exist. I have never heard of minato before, but since you mention it, maybe you're using it to cache the UniDic files or something?

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

Yes, the file exists I am using the python -m unidic download command and I have tried re-installing it multiple times but no luck! Yep sorry you can ignore Minato.

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

Here is the code

class Tokenizer:
    def __init__(
        self,
        system_dictionary_path: Optional[Union[str, PathLike]] = None,
        user_dictionary_path: Optional[Union[str, PathLike]] = None,
    ) -> None:
        if system_dictionary_path == "ipadic":
            system_dictionary_path = ipadic.DICDIR
        elif system_dictionary_path == "unidic":
            system_dictionary_path = unidic.DICDIR

        self._system_dictionary_path = system_dictionary_path or unidic.DICDIR
        self._user_dictionary_path = user_dictionary_path

        self._tagger: Optional[fugashi.Tagger] = None

    @classmethod
    def from_config(cls, config: SectionProxy) -> "Tokenizer":
        return Tokenizer(
            system_dictionary_path=config.get("system_dictionary_path"),
            user_dictionary_path=config.get("user_dictionary_path"),
        )

    @property
    def tagger(self) -> fugashi.Tagger:
        # setup tagger
        options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
        if self._user_dictionary_path:
            options.append(f"-u {minato.cached_path(self._user_dictionary_path)}")
        if not self._tagger:
            self._tagger = fugashi.GenericTagger(" ".join(options))
        # setup token parser
        if "ipadic" in str(self._system_dictionary_path):
            self._parse_feature = parse_feature_for_ipadic
        elif "unidic" in str(self._system_dictionary_path):
            self._parse_feature = parse_feature_for_unidic
        else:
            raise ValueError("system_dictionary_path must contain 'ipadic' or 'unidic'")

        return self._tagger

    @staticmethod
    def normalize(text: str) -> str:
        text = jaconv.z2h(text, kana=False, ascii=True, digit=True)
        text = jaconv.h2z(text, kana=True, ascii=False, digit=False)
        text = text.replace("〜", "ー")
        return text

    def tokenize(self, text: str) -> List[Token]:
        return [self._parse_feature(token) for token in self.tagger(text)]

    def __getstate__(self) -> Dict[str, Any]:
        return {
            "system_dictionary_path": self._system_dictionary_path,
            "user_dictionary_path": self._user_dictionary_path,
        }

    def __setstate__(self, state: Dict[str, Any]) -> None:
        self._tagger = None
        self._system_dictionary_path = state["system_dictionary_path"]
        self._user_dictionary_path = state["user_dictionary_path"]

from fugashi.

polm avatar polm commented on June 12, 2024

What is an example of the actual code that causes the issue? You have provided a class definition but no code using it. Also, you said it was OK to ignore minato, but your example code uses minato to cache the dictionary path...

Does just using this code work?

import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

Yes it works

Code:

import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

Output:

麩      麩      名詞,普通名詞,一般,*
菓子    菓子    名詞,普通名詞,一般,*
は      は      助詞,係助詞,*,*
、      、      補助記号,読点,*,*
麩      麩      名詞,普通名詞,一般,*
を      を      助詞,格助詞,*,*
主材    主材    名詞,普通名詞,一般,*
料      料      接尾辞,名詞的,一般,*
と      と      助詞,格助詞,*,*
し      為る    動詞,非自立可能,*,*
た      た      助動詞,*,*,*
日本    日本    名詞,固有名詞,地名,国
の      の      助詞,格助詞,*,*
菓子    菓子    名詞,普通名詞,一般,*
。      。      補助記号,句点,*,*

from fugashi.

polm avatar polm commented on June 12, 2024

OK, in that case it seems like something is wrong with your wrapper class, particularly this line:

options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

Ok let me try using unidic.DICDIR directly without minato

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

No luck, actually the tokenizer is working on most of the text but after some time it gets stuck on this error while it tries to open the matrix.bin file. Not sure if it's a memory issue, I have 2.5 million strings to tokenize.

from fugashi.

polm avatar polm commented on June 12, 2024

The matrix.bin file is only accessed when the Tagger is first created, so it sounds like you're creating multiple taggers. Are you doing something like #35 where you're creating a Tagger inside a loop or something?

You typically don't need more than one Tagger in a whole process, or at most one per thread.

from fugashi.

polm avatar polm commented on June 12, 2024

Closing because this seems to be a usage issue and there's not enough information to debug it. If you can provide a reproducible example I will take a closer look.

from fugashi.

m-hammad-khan avatar m-hammad-khan commented on June 12, 2024

It is solved thanks, I was creating multiple instances.

from fugashi.

polm avatar polm commented on June 12, 2024

Glad you figured it out. You need to be careful when creating multiple instances, as you can quickly run out of memory, which can cause mmap errors.

from fugashi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.