Comments (12)
Sorry you're having trouble with this.
Does the listed matrix.bin
file exist? Have you tried re-installing?
mmap
related failures usually have to do with running out of memory due to creating too many Tagger objects, but in this case it looks like the file may not exist. I have never heard of minato
before, but since you mention it, maybe you're using it to cache the UniDic files or something?
from fugashi.
Yes, the file exists I am using the python -m unidic download
command and I have tried re-installing it multiple times but no luck! Yep sorry you can ignore Minato.
from fugashi.
Here is the code
class Tokenizer:
def __init__(
self,
system_dictionary_path: Optional[Union[str, PathLike]] = None,
user_dictionary_path: Optional[Union[str, PathLike]] = None,
) -> None:
if system_dictionary_path == "ipadic":
system_dictionary_path = ipadic.DICDIR
elif system_dictionary_path == "unidic":
system_dictionary_path = unidic.DICDIR
self._system_dictionary_path = system_dictionary_path or unidic.DICDIR
self._user_dictionary_path = user_dictionary_path
self._tagger: Optional[fugashi.Tagger] = None
@classmethod
def from_config(cls, config: SectionProxy) -> "Tokenizer":
return Tokenizer(
system_dictionary_path=config.get("system_dictionary_path"),
user_dictionary_path=config.get("user_dictionary_path"),
)
@property
def tagger(self) -> fugashi.Tagger:
# setup tagger
options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
if self._user_dictionary_path:
options.append(f"-u {minato.cached_path(self._user_dictionary_path)}")
if not self._tagger:
self._tagger = fugashi.GenericTagger(" ".join(options))
# setup token parser
if "ipadic" in str(self._system_dictionary_path):
self._parse_feature = parse_feature_for_ipadic
elif "unidic" in str(self._system_dictionary_path):
self._parse_feature = parse_feature_for_unidic
else:
raise ValueError("system_dictionary_path must contain 'ipadic' or 'unidic'")
return self._tagger
@staticmethod
def normalize(text: str) -> str:
text = jaconv.z2h(text, kana=False, ascii=True, digit=True)
text = jaconv.h2z(text, kana=True, ascii=False, digit=False)
text = text.replace("〜", "ー")
return text
def tokenize(self, text: str) -> List[Token]:
return [self._parse_feature(token) for token in self.tagger(text)]
def __getstate__(self) -> Dict[str, Any]:
return {
"system_dictionary_path": self._system_dictionary_path,
"user_dictionary_path": self._user_dictionary_path,
}
def __setstate__(self, state: Dict[str, Any]) -> None:
self._tagger = None
self._system_dictionary_path = state["system_dictionary_path"]
self._user_dictionary_path = state["user_dictionary_path"]
from fugashi.
What is an example of the actual code that causes the issue? You have provided a class definition but no code using it. Also, you said it was OK to ignore minato, but your example code uses minato to cache the dictionary path...
Does just using this code work?
import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
from fugashi.
Yes it works
Code:
import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
print(word, word.feature.lemma, word.pos, sep='\t')
# "feature" is the Unidic feature data as a named tuple
Output:
麩 麩 名詞,普通名詞,一般,*
菓子 菓子 名詞,普通名詞,一般,*
は は 助詞,係助詞,*,*
、 、 補助記号,読点,*,*
麩 麩 名詞,普通名詞,一般,*
を を 助詞,格助詞,*,*
主材 主材 名詞,普通名詞,一般,*
料 料 接尾辞,名詞的,一般,*
と と 助詞,格助詞,*,*
し 為る 動詞,非自立可能,*,*
た た 助動詞,*,*,*
日本 日本 名詞,固有名詞,地名,国
の の 助詞,格助詞,*,*
菓子 菓子 名詞,普通名詞,一般,*
。 。 補助記号,句点,*,*
from fugashi.
OK, in that case it seems like something is wrong with your wrapper class, particularly this line:
options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
from fugashi.
Ok let me try using unidic.DICDIR directly without minato
from fugashi.
No luck, actually the tokenizer is working on most of the text but after some time it gets stuck on this error while it tries to open the matrix.bin file. Not sure if it's a memory issue, I have 2.5 million strings to tokenize.
from fugashi.
The matrix.bin
file is only accessed when the Tagger is first created, so it sounds like you're creating multiple taggers. Are you doing something like #35 where you're creating a Tagger inside a loop or something?
You typically don't need more than one Tagger in a whole process, or at most one per thread.
from fugashi.
Closing because this seems to be a usage issue and there's not enough information to debug it. If you can provide a reproducible example I will take a closer look.
from fugashi.
It is solved thanks, I was creating multiple instances.
from fugashi.
Glad you figured it out. You need to be careful when creating multiple instances, as you can quickly run out of memory, which can cause mmap errors.
from fugashi.
Related Issues (20)
- Supporting N-best Paths HOT 4
- type stubs HOT 2
- How to use with Contemporary Spoken Japanese dictionary unidic? HOT 3
- method for preserving half-width spaces? HOT 8
- Unable to Install (Windows x64, Python 3.11.0, fugashi 1.2.0) HOT 3
- When building a user dict, check number of fields
- The unidic_lite dictionary is not installed HOT 5
- UniDic v3.1.1 サポート件 HOT 1
- Importing fugashi raises ImportError on macOS HOT 3
- Lemmatizing particles に、で HOT 3
- Vectorizing Japanese After Lemmatization HOT 1
- Is it possible to apply the user dictionary which is a object instead of a file ? HOT 2
- Questions and thoughts(fix of making user dict, unidic terms and mecab_node_t attributes) HOT 5
- Add access to more Node fields
- Installing error when using `python:alpine` as the base image HOT 7
- Failed initializing MeCab HOT 4
- Question about installing on visual studio 2022 windows HOT 3
- Can't install on MacOS Ventura Intel x86 Python 3.11 HOT 5
- Pylance linting gives error: "Tagger" is not a known member of module "fugashi" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fugashi.