Comments (7)
I believe the issue is that the proposed changes are on the Fix-Remove-Diacritics branch of the repository. With your command, pip uses the master branch where the changes are not yet implemented.
You can try using !pip install git+https://github.com/SummerOfCode-NoHate/texthero.git@Fix_Remove_Diacritics
, I think that should work. If not, you could just copy the functions I pasted in #72 and try them out directly.
from texthero.
Hi, I just had a look and opened a PR to fix this at #72 . I pasted the new functions there, would be great if you could try them out and comment there whether everything works for you. Maybe I missed something regarding the Urdu language.
from texthero.
You can correct me if am doing wrong, I have never engaged into Github.
Installed your version,
!pip install git+https://github.com/SummerOfCode-NoHate/texthero.git
and tested with the same code
import pandas as pd
import texthero as hero
s = pd.Series("Montréal, über, noël, 889, اِس, اُس")
s1 = hero.remove_diacritics(s)
s1
gives the following output
0 Montreal, uber, noel, 889, is, us
dtype: object
from texthero.
Hi Hashim,
thank you for your contribution!
@henrifroese and @cmhashim, probably the way we should design multilingual support for Texthero is to have:
from texthero.ur import hero
hero.remove_diacritics(...)
Where this remove_diacritics
is specialized in dealing with Urdu text.
What's your opinion? That way we can keep the code in each function simple, as well as develop each function for that specific language.
from texthero.
I think that if functions for multilingual support are added (e.g. functions to handle stuff regarding arabic script specifically) they should get separate modules and that would make sense. However, I think that this issue/fix is more generally improving the remove_diacritics by preventing transliteration, so it can now handle everything from before + urdu / arabic / ... , which is why I wouldn't put it in a separate module.
from texthero.
That makes complete sense! 👍
from texthero.
from texthero.ur import hero hero.remove_diacritics(...)
Where this
remove_diacritics
is specialized in dealing with Urdu text.
I think its time to do this. It was my mistake i gave very simple example of Urdu text with diacritics, but it much more complex to handle diacritics in Urdu/Arabic. Some diacritics are part of Urdu words, and it must be written, and some can be excluded. Hence, can we have a optional argument, to exclude/include a list of diacritics to retain/remove it.
Some Examples:
retain_diacritics_eg_text = "فوراً, حتیٰ, آزاد, ہوئی"
from texthero.
Related Issues (20)
- Module has no name errors HOT 3
- Discussion - stopwords HOT 4
- How to retrieve TDIDF feature names HOT 1
- Question regarding application of this tool to other language documents HOT 1
- spaCy 3 support HOT 1
- here.scatterplot not working
- Can't get texthero to work with current versions of spacy and gensim HOT 1
- kmeans error: __init__() got an unexpected keyword argument 'precompute_distances' HOT 3
- [QUESTION] What is the normalization method used in `top_words()`?
- How can I remove all punctuation except for "@" HOT 2
- `remove_punctuation()` is not removing "\" HOT 2
- Deprecated arguments on kmeans function call HOT 5
- Import error HOT 2
- Import Error (YAML Loader)
- installation error: Could not build wheels for spacy, which is required to install pyproject.toml-based projects HOT 2
- Is there any function to find how the weights are calculated for each word to represent a sentence?
- TextHero Documentation link is producing a 404
- is this package being maintained? HOT 1
- Visualization of PCA on embedding space with multi-labels HOT 1
- ModuleNotFoundError: No module named 'gensim.sklearn_api' on import HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from texthero.