Giter Club home page Giter Club logo

Comments (19)

LandGrey avatar LandGrey commented on May 29, 2024

Can pydictor find and remove non-UTF-8 characters?

Currently not contains this function, you can use other tool to finish it, such as iconv

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

Thanks for answer.

I applied the iconv tool with this command iconv -f utf-8 -t utf-8 -c test.txt -o clean_test.txt.
Then i used pydictor on the clean_test.txt with: pydictor --len 6 20 -tool handler clean_test.txt -o super_clean_test.txt

But i got the following error:

File "E:\pydictor-2.0.5\pydictor.py", line 107, in <module> tool_parser() 
File "E:\pydictor-2.0.5\lib\parse\argsparse.py", line 104, in tool_parser get_handler_dic(pyoptions.args_tool[1]) 
File "E:\pydictor-2.0.5\tools\handler.py", line 24, in get_handler_dic for item in f.readlines(): 
File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3858: character maps to <undefined>

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

Thanks for answer.

I applied the iconv tool with this command iconv -f utf-8 -t utf-8 -c test.txt -o clean_test.txt. Then i used pydictor on the clean_test.txt with: pydictor --len 6 20 -tool handler clean_test.txt -o super_clean_test.txt

But i got the following error:

File "E:\pydictor-2.0.5\pydictor.py", line 107, in <module> tool_parser() 
File "E:\pydictor-2.0.5\lib\parse\argsparse.py", line 104, in tool_parser get_handler_dic(pyoptions.args_tool[1]) 
File "E:\pydictor-2.0.5\tools\handler.py", line 24, in get_handler_dic for item in f.readlines(): 
File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3858: character maps to <undefined>

I add the filter printable character tool for pydictor just now.
you can download the latest pydictor version (2.1.5.6) and using command python pydictor.py --len 6 20 -tool printabler test.txt to get your wordlist.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

Thank you for your help.
Unfortunately the issue is still occuring. You can test it with this file i found online: https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

Thank you for your help. Unfortunately the issue is still occuring. You can test it with this file i found online: https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

I fixed the bug.
Please download pydictor 2.1.5.7 version and using command python pydictor.py --len 6 20 -tool printabler test.txt.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

Thank you for your effort.
Unfortunately i got the same error.

Untitled

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

2.1.5.7

try 2.1.5.8 version.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

You did it! Amazing!!
Thank you so so much!! 💯

image

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

Update: Now the pydictor works well with non UTF-8 characters.
The issue i got now is that i hit a memory limit.

image

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

Update: Now the pydictor works well with non UTF-8 characters. The issue i got now is that i hit a memory limit.

image

try latest version 2.1.6.0.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

I tried the 2.1.6.0 version and i got a different behavior. But the same end result.
The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

image

At the end of the process:

image

image

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

I tried the 2.1.6.0 version and i got a different behavior. But the same end result. The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

image

At the end of the process:

image

image

That's due to "memory remove duplicate file lines by preserving order" caused.
Your input files must be huge, I would to consider a better way to fix it.

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

I tried the 2.1.6.0 version and i got a different behavior. But the same end result. The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

image

At the end of the process:

image

image

maybe you can try version 2.1.7.0.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

Unfortunately the same behavior. For info the file size of this dictionnary is 400GB.

image

image

from pydictor.

LandGrey avatar LandGrey commented on May 29, 2024

Download the latest version 2.1.7.1, I reduce the pydictor lib/data/data.py file pyoptions.memory_unique_max_lines_count variable to 10000000 , just try again. If the same behavior, you can reduce the variable and try it again.

Unfortunately the same behavior. For info the file size of this dictionnary is 400GB.

image

image

Download the latest version 2.1.7.1, I reduce the pydictor lib/data/data.py file pyoptions.memory_unique_max_lines_count variable to 10000000 , just try again.
If the same behavior, you can reduce the variable and try it again.

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

OK. I'm experimenting playing with the variables. I would like to ask you if there is a way for pydictor to show the progress in %, or in lines processed. Any indication of the progress made by Pydictor would be very helpful for troubleshooting and also benchmarking the performance effect of different variables :)

from pydictor.

Privacy6484847 avatar Privacy6484847 commented on May 29, 2024

So after a lot of testing. Reducing this variable pyoptions.memory_unique_max_lines_count lowers performance dramatically and it just makes the error take more time to happen.
I actually even tried with a relatively small file of 30gb. And i gave windows 60GB of paging file size. Still Pydictor ended up in memory error.
Maybe instead of using the paging file, pydictor should save incrementally the data processed directly to the hard drive. :)

from pydictor.

tl123987 avatar tl123987 commented on May 29, 2024

可以加个限制字典生成行数的功能不

from pydictor.

tl123987 avatar tl123987 commented on May 29, 2024

图像

自己写规则出现错误咋回事哇,作者

from pydictor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.