Giter Club home page Giter Club logo

pyth's People

Contributors

ayal avatar brendonh avatar eriol avatar guillaumechereau avatar meirkriheli avatar mihaip avatar sets88 avatar somakrdas avatar watercrossing avatar yairchu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyth's Issues

Hex (encoded images) in `\pict` control groups is not removed.

When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict control group itself is gone.

At the moment, I'm preprocessing these files to wipe out the pict group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.

I can send you an example, if needed.

Import field text from RTFs

The code currently passes over all non-photo fields, including textboxes with text in them, drop-down menus with items selected, and check blocks that are checked.

I tried to figure out how to deal with this into the code's current structure, by incorporating something into the handle_field method. But I couldn't figure it out. So instead I made a pre-processing function that goes through and finds those types of fields, and replaces those fields with whatever text was supposed to be there: the entered text if a textbox, the selected text if a drop-down list (or the default if that's appropriate), and a "Yes" or "No" if it was a checkblock. Then, when you run it through the converter it will come out as plain text.

This required regex rather than re.

I'm not submitting this as a pull request because I'm not sure where you'd want to include this sort of pre-processing. But if you want to do so, here is the function:



import regex

def flattenrtffields(rawrtf):

    #get all "fields" including nested
    fieldsearch=regex.compile(r"{\\field[^{]*?({(?>[^{}]+|(?1))*})({(?>[^{}]+|(?1))*})}")
    m = fieldsearch.finditer(rawrtf)
    if m:

      textboxes,drops,checks=[],[],[]
      checkboxoptions=["No","Yes"]

      #Make lists of the kinds of fields to flatten
      for field in m:
        if "FORMTEXT" in field[0]:
          textboxes.append(field[0])
        elif "FORMDROPDOWN" in field[0]:
          drops.append(field[0])
        elif "FORMCHECKBOX" in field[0]:
          checks.append(field[0])
        else:
          pass

      #deal with textboxes
      for textbox in textboxes:
        try:
          result = regex.search(r"fldrslt ({(?>[^{}]+|(?1))*})}",textbox)[1]
          if result:
            rawrtf=rawrtf.replace(textbox,result)
        except:
          pass

      #deal with dropdownlists
      for drop in drops:
        try:
          ddresult = regex.search(r"fftype2.*ffres([0-9]*)",drop)[1]
          if ddresult=="25":
            ddresult=regex.search(r"ffdefres([0-9]*)",drop)[1]
          ddlist = re.findall(r"ffl ([^}]*)}",drop)
          rawrtf=rawrtf.replace(drop,"{\\rtlch "+ddlist[int(ddresult)]+"}")
        except:
          pass

      #deal with checkboxes
      for check in checks:
        try:
          result = regex.search(r"fftype1.*ffres([0-9]*)",check)[1]
          if result=="25":
            result=regex.search(r"ffdefres([0-9]*)",check)[1]
          rawrtf=rawrtf.replace(check,"{\\rtlch "+checkboxoptions[int(ddresult)]+"}")
        except:
          pass

    return rawrtf

newline for plaintext writer

It seems that newline parameter is not used. I can make a pull request, but it's my first use of this tools, so please confirm that i should just replace "\n" by self.newline.

It seems also that it's a bug from rtf to text between newline from paragraph and newline from newline... But i must investigate to see exactly why.

Exclude images when writing plain text files

Converting an RTF with images to a plain text file writes the binary data of the image to the plain text file. When I create a plain text file, I expect that the images will be stripped.

pyth.__version__ claims to be 0.5.6 even though packaged as 0.6.0

setup.py shows the current version is 0.6.0 but the __version__ string in __init__.py is still set to 0.5.6. I noticed this when I did a pip install today and tried to track down why I was getting the version from 2010 rather than 2014. (Turns out I wasn't; the version string just needs to be updated.)

If you could please fix this discrepancy, I would appreciate it. Thanks!

Unicode error when reading RTF

When trying to read https://www.gnu.org/licenses/lgpl.rtf I get:

>>> b=Rtf15Reader.read(open('lgpl.rtf', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 86, in read
    return reader.go()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 109, in go
    self.parse()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 143, in parse
    self.group.handle(control, digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 402, in handle
    handler(digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 521, in handle_ansi_escape
    char = chr(code).decode(self.charset, self.reader.errors)
UnicodeDecodeError: 'cp932' codec can't decode byte 0x81 in position 0: incomplete multibyte sequence

rft control word "\f0" not reconized

Im using rtf files generated by pandoc. They have a lot of "\f0" control words (no idea why).

/plugins/rtf15/reader.py cannot read these files because of this "\f0" word.

For a general solution, could you skip unknown control words?

Example rtf:
{\rtf\ansi\deff0{\fonttbl{\f0\froman Tms Rmn;}{\f1\fdecor
Symbol;}{\f2\fswiss Helv;}}{\colortbl;\red0\green0\blue0;
\red0\green0\blue255;\red0\green255\blue255;\red0\green255
blue0;\red255\green0\blue255;\red255\green0\blue0;\red255
green255\blue0;\red255\green255\blue255;}{\stylesheet{\fs20
\snext0Normal;}}{\info{\author John Doe}
{\creatim\yr1990\mo7\dy30\hr10\min48}{\version1}{\edmins0}
{\nofpages1}{\nofwords0}{\nofchars0}{\vern8351}}\widoctrl\ftnbj \sectd\linex0\endnhere \pard\plain \fs20 This is plain text.\

Implement lists in RTF reader

It could be nice if the RTF reader was able to parse \listtable and read \levelnfc (type of list) and possibly \levelstartat (list start value) as documented here. It should also implement list overrides, but just to get the proper reference for \ls elements.

I have a lot of tiny RTF documents, some including bulleted/numbered lists, and I would love to be able to preserve this information and output it as XHMTL.

Please upload a new release to PyPi

Hi

The latest pyth on PyPi is from Aug. 2010. Since then, commit c444a8d has fixed an issue that has came back to bite me repeatedly. I would appreciate if you could upload a more recent version of pyth, so that when I install from PyPi, it comes without this issue.

Thanks!

rtf reader: unichr() causes ValueError ()

I have a rtf file with strange unicode strings (send you an email).

This causes rtf reader to throw ValueError:

* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build) 

The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.

rtf reader: nonasci metadata causes UnicodeDecodeError (openoffice rtf files)

I have openoffice rtf files with nonasci metadata (author):

{\info{\author Claudia Jürgens}{\creatim\yr2010\mo7\dy19\hr12\min45}{\author Claudia Jürgens}
{\revtim\yr2010\mo7\dy28\hr13\min27}{\printim\yr0\mo0\dy0\hr0\min0}{\comment    
StarWriter}{\vern3000}}\deftab709   

This causes UnicodeDecodeError:

Module pyth.plugins.rtf15.reader, line 93, in read
Module pyth.plugins.rtf15.reader, line 113, in go                                           
Module pyth.plugins.rtf15.reader, line 147, in parse
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) 

This patch just catches the error:

*** reader.py  2010-05-04 21:48:14.000000000 +0200 
--- reader.py   2010-08-04 21:47:10.000000000 +0200 
***************       
*** 140,146 ****      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 self.group.char(unicode(next)) 


      def getControl(self): 
--- 140,149 ----      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 try: 
!                     self.group.char(unicode(next)) 
!                 except UnicodeDecodeError, e: 
!                     self.group.char('?') 


      def getControl(self):

CJK characters support for RTF parse

Hello guys,

CJK means Chinese, Japanese, and Korean. Many ancient RTF writer doesn't store these characters in Unicode, and use pyth to read CJK characters from these ancient RTF documents would cause "UnicodeDecodeError" due to CJK codecs actually use 4 hex digits not 2.

I did modified plugins/rtf15/reader.py to resolve my own needs. But I still hope someone can write a better code to deal with this issue.

1)Add this first:

from binascii import unhexlify

2)Add number 936:

# All the ones named by number in my 2.6 encodings dir
_CODEPAGES_BY_NUMBER = dict(
    (x, "cp%s" % x) for x in (37, 1006, 1026, 1140, 1250, 1251, 1252, 1253, 1254, 1255,
                              1256, 1257, 1258, 424, 437, 500, 737, 775, 850, 852, 855,
                              856, 857, 860, 861, 862, 863, 864, 865, 866, 869, 874,
                              875, 932, 936, 949, 950))

3)Change to 'ignore' :

def read(self, source, errors='ignore'):

4):

            if next == "'":
                # ANSI escape, takes two hex digits
                chars.extend("ansi_escape")
                digits.extend(self.source.read(2))

                #For some asian languages, takes two more digits

                #Japanse:
                if self.charset == "cp932":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))    
                #Simplified Chinese:       
                if self.charset == "cp936":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))                          
                #Korean:
                if self.charset == "cp949":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))          
                #Traditional Chinese:
                if self.charset == "cp950":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))

                break
    def handle_ansi_escape(self, code):
        cjk = code
        code = int(code, 16)

        if isinstance(self.charset, dict):
            uni_code = self.charset.get(code)
            if uni_code is None:
                char = u'?'
            else:
                char = unichr(uni_code)

        else:
            if code <= 255:
               char = chr(code).decode(self.charset, self.reader.errors)
               self.content.append(char)
            else:
               char = unhexlify(cjk).decode(self.charset, self.reader.errors)
               self.content.append(char)

rtf reader: decode argument TypeError

send you a test file

Module pyth.plugins.rtf15.reader, line 103, in read
Module pyth.plugins.rtf15.reader, line 124, in go
Module pyth.plugins.rtf15.reader, line 155, in parse 
Module pyth.plugins.rtf15.reader, line 385, in char
TypeError: decode() argument 1 must be string, not dict

Python 3 support

It would be nice if pyth would support Python 3. Is that support on the roadmap?

Request - Support for parsing string instead of file

Looking for the capability to parse the string content instead of reading from a file. More like an overload of Ptf15Reader.Read() function which excepts string as an input.

My code is reading the content from a database table and one of the columns contains RTF content which needs to be parsed as plain text.

For example: Column content
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}}{\colortbl ;\red0\green0\blue0;}{\*\generator Riched20 15.0.4811}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\f0\fs20\lang2057 hi\f1\lang1033\par{\*\lyncflags<rtf=1>}}

Should be parsed as:

hi

Is there a way to achieve it using this utility?

problem with reading file.rtf with table inside

Hello!
Could you please help.

I have file.rtf with table inside. I need to read text from table.
But program do not make any deviding when move from one cell to another inside one row.

Example:
Input: Cell 1 : How / Cell 2 : are / Cell 3 : you
Output: Howareyou

My code is:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('test.rtf', 'rb'))
PlaintextWriter.write(doc).getvalue()

Thank you!

Bump to 0.6.1?

Any chance you can cut a release and bump the PyPI version? Thanks!

rtf reading example broken?

Hi, I've just cloned repo and tried to run rtf reading example. Error:

Traceback (most recent call last):
File "D:\dev\soft\pyth\rtf15.py", line 11, in
doc = Rtf15Reader.read(open(filename))
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 103, in read
return reader.go()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 124, in go
self.parse()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 143, in parse
self.group = self.stack[-1]
IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.