brendonh / pyth Goto Github PK

Python text markup and conversion

License: MIT License

Python 100.00%

pyth's Introduction

========================================
pyth - Python text markup and conversion
========================================

Pyth is intended to make it easy to convert marked-up text between different common formats.

*Marked-up text* means text which has:

* Paragraphs
* Headings
* Bold, italic, and underlined text
* Hyperlinks
* Bullet lists
* Simple tables
* Very little else


Formats I initially want to support are:

* xhtml
* rtf
* pdf (output)


These three formats cover web, Word / OpenOffice, and print. 


Design principles
=================

* Ignore unsupported information in input formats (e.g. page layout)
* Ignore font issues -- output in a single font.
* Ignore specific text sizes -- support relative sizes (bigger, littler) only. Output in a single base size.
* Have no dependencies unless they are written in Python, and work
* Make it easy to add support for new formats, by using an architecture based on *plugins* and *adapters*.



Examples
========

See http://github.com/brendonh/pyth/tree/master/examples/


Unit tests
==========

The sources contains some unit tests (written using python unittest
module) in the 'tests' directory.

To run the tests we can either run them individually as python script,
either use `python nose`_.  If using nose then we just need to go into
the tests directory and invoke nosetest from there (make sure that
pyth module is in PYTHONPATH).

.. _python nose: http://code.google.com/p/python-nose/

pyth's People

Contributors

Stargazers

Watchers

pyth's Issues

Please upload a new release to PyPi

The latest pyth on PyPi is from Aug. 2010. Since then, commit c444a8d has fixed an issue that has came back to bite me repeatedly. I would appreciate if you could upload a more recent version of pyth, so that when I install from PyPi, it comes without this issue.

Thanks!

rtf to xhtml generates wrong non asci characters

I found something new ;), send you a test file.

The xhml output should be:

Mit fünf Jahren

Instead it shows:

Mit fŸnf Jahren

PlainTextWriter adds extraneous newline after each paragraph

See

pyth/pyth/plugins/plaintext/writer.py

Line 36 in f2a06fc

self.target.write("\n")

I think that these newline should not be added (or the newline argument should be honored when using PlainTextWriter.write

Add more tests, for instance from unrtf

unrtf has several interesting RTF test files, some of which fail to open with pyth:
http://ftp.gnu.org/gnu/unrtf/?C=M;O=D
It could be a good addition to the test suite

pyth.version claims to be 0.5.6 even though packaged as 0.6.0

setup.py shows the current version is 0.6.0 but the __version__ string in __init__.py is still set to 0.5.6. I noticed this when I did a pip install today and tried to track down why I was getting the version from 2010 rather than 2014. (Turns out I wasn't; the version string just needs to be updated.)

If you could please fix this discrepancy, I would appreciate it. Thanks!

rft control word "\f0" not reconized

Im using rtf files generated by pandoc. They have a lot of "\f0" control words (no idea why).

/plugins/rtf15/reader.py cannot read these files because of this "\f0" word.

For a general solution, could you skip unknown control words?

Example rtf:
{\rtf\ansi\deff0{\fonttbl{\f0\froman Tms Rmn;}{\f1\fdecor
Symbol;}{\f2\fswiss Helv;}}{\colortbl;\red0\green0\blue0;
\red0\green0\blue255;\red0\green255\blue255;\red0\green255
blue0;\red255\green0\blue255;\red255\green0\blue0;\red255
green255\blue0;\red255\green255\blue255;}{\stylesheet{\fs20
\snext0Normal;}}{\info{\author John Doe}
{\creatim\yr1990\mo7\dy30\hr10\min48}{\version1}{\edmins0}
{\nofpages1}{\nofwords0}{\nofchars0}{\vern8351}}\widoctrl\ftnbj \sectd\linex0\endnhere \pard\plain \fs20 This is plain text.\

problem with reading file.rtf with table inside

Hello!
Could you please help.

I have file.rtf with table inside. I need to read text from table.
But program do not make any deviding when move from one cell to another inside one row.

Example:
Input: Cell 1 : How / Cell 2 : are / Cell 3 : you
Output: Howareyou

My code is:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('test.rtf', 'rb'))
PlaintextWriter.write(doc).getvalue()

Thank you!

Parse colortbl and colored text

Unicode error when reading RTF

When trying to read https://www.gnu.org/licenses/lgpl.rtf I get:

>>> b=Rtf15Reader.read(open('lgpl.rtf', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 86, in read
    return reader.go()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 109, in go
    self.parse()
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 143, in parse
    self.group.handle(control, digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 402, in handle
    handler(digits)
  File "/home/pom/tmp/local/lib/python2.7/site-packages/pyth/plugins/rtf15/reader.py", line 521, in handle_ansi_escape
    char = chr(code).decode(self.charset, self.reader.errors)
UnicodeDecodeError: 'cp932' codec can't decode byte 0x81 in position 0: incomplete multibyte sequence

Parsing RTF fails when an escaped quote is followed by non-hex digits

Parsing a string like \'hello causes an exception because the RTF reader expects the following two characters to be hex digits.

rtf to xhtml conversion ignores "}"

My rtf file has "}" symbols ( "{" ), but the generated XHTML has them no more.

Send you an test file.

rtf reader: nonasci metadata causes UnicodeDecodeError (openoffice rtf files)

I have openoffice rtf files with nonasci metadata (author):

{\info{\author Claudia Jürgens}{\creatim\yr2010\mo7\dy19\hr12\min45}{\author Claudia Jürgens}
{\revtim\yr2010\mo7\dy28\hr13\min27}{\printim\yr0\mo0\dy0\hr0\min0}{\comment    
StarWriter}{\vern3000}}\deftab709

This causes UnicodeDecodeError:

Module pyth.plugins.rtf15.reader, line 93, in read
Module pyth.plugins.rtf15.reader, line 113, in go                                           
Module pyth.plugins.rtf15.reader, line 147, in parse
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)

This patch just catches the error:

*** reader.py  2010-05-04 21:48:14.000000000 +0200 
--- reader.py   2010-08-04 21:47:10.000000000 +0200 
***************       
*** 140,146 ****      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 self.group.char(unicode(next)) 


      def getControl(self): 
--- 140,149 ----      
                  control, digits = self.getControl() 
                  self.group.handle(control, digits) 
              else:   
!                 try: 
!                     self.group.char(unicode(next)) 
!                 except UnicodeDecodeError, e: 
!                     self.group.char('?') 


      def getControl(self):

wiki.github.com/brendonh/pyth is a broken link

In setup.py, the home page of this project is set to https://wiki.github.com/brendonh/pyth . This link is broken.

This shows up as a link in the page https://pypi.org/project/pyth/ , clicking on "Homepage" will lead to a 404 error.

CJK characters support for RTF parse

Hello guys,

CJK means Chinese, Japanese, and Korean. Many ancient RTF writer doesn't store these characters in Unicode, and use pyth to read CJK characters from these ancient RTF documents would cause "UnicodeDecodeError" due to CJK codecs actually use 4 hex digits not 2.

I did modified plugins/rtf15/reader.py to resolve my own needs. But I still hope someone can write a better code to deal with this issue.

1)Add this first:

from binascii import unhexlify

2)Add number 936:

# All the ones named by number in my 2.6 encodings dir
_CODEPAGES_BY_NUMBER = dict(
    (x, "cp%s" % x) for x in (37, 1006, 1026, 1140, 1250, 1251, 1252, 1253, 1254, 1255,
                              1256, 1257, 1258, 424, 437, 500, 737, 775, 850, 852, 855,
                              856, 857, 860, 861, 862, 863, 864, 865, 866, 869, 874,
                              875, 932, 936, 949, 950))

3)Change to 'ignore' :

def read(self, source, errors='ignore'):

4):

            if next == "'":
                # ANSI escape, takes two hex digits
                chars.extend("ansi_escape")
                digits.extend(self.source.read(2))

                #For some asian languages, takes two more digits

                #Japanse:
                if self.charset == "cp932":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))    
                #Simplified Chinese:       
                if self.charset == "cp936":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))                          
                #Korean:
                if self.charset == "cp949":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))          
                #Traditional Chinese:
                if self.charset == "cp950":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))

                break

    def handle_ansi_escape(self, code):
        cjk = code
        code = int(code, 16)

        if isinstance(self.charset, dict):
            uni_code = self.charset.get(code)
            if uni_code is None:
                char = u'?'
            else:
                char = unichr(uni_code)

        else:
            if code <= 255:
               char = chr(code).decode(self.charset, self.reader.errors)
               self.content.append(char)
            else:
               char = unhexlify(cjk).decode(self.charset, self.reader.errors)
               self.content.append(char)

newline for plaintext writer

It seems that newline parameter is not used. I can make a pull request, but it's my first use of this tools, so please confirm that i should just replace "\n" by self.newline.

It seems also that it's a bug from rtf to text between newline from paragraph and newline from newline... But i must investigate to see exactly why.

Python 3 support

It would be nice if pyth would support Python 3. Is that support on the roadmap?

Implement lists in RTF reader

It could be nice if the RTF reader was able to parse \listtable and read \levelnfc (type of list) and possibly \levelstartat (list start value) as documented here. It should also implement list overrides, but just to get the proper reference for \ls elements.

I have a lot of tiny RTF documents, some including bulleted/numbered lists, and I would love to be able to preserve this information and output it as XHMTL.

rtf reader: decode argument TypeError

send you a test file

Module pyth.plugins.rtf15.reader, line 103, in read
Module pyth.plugins.rtf15.reader, line 124, in go
Module pyth.plugins.rtf15.reader, line 155, in parse 
Module pyth.plugins.rtf15.reader, line 385, in char
TypeError: decode() argument 1 must be string, not dict

rtf reading example broken?

Hi, I've just cloned repo and tried to run rtf reading example. Error:

Traceback (most recent call last):
File "D:\dev\soft\pyth\rtf15.py", line 11, in
doc = Rtf15Reader.read(open(filename))
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 103, in read
return reader.go()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 124, in go
self.parse()
File "D:\dev\soft\pyth\pyth\plugins\rtf15\reader.py", line 143, in parse
self.group = self.stack[-1]
IndexError: list index out of range

Import field text from RTFs

The code currently passes over all non-photo fields, including textboxes with text in them, drop-down menus with items selected, and check blocks that are checked.

I tried to figure out how to deal with this into the code's current structure, by incorporating something into the handle_field method. But I couldn't figure it out. So instead I made a pre-processing function that goes through and finds those types of fields, and replaces those fields with whatever text was supposed to be there: the entered text if a textbox, the selected text if a drop-down list (or the default if that's appropriate), and a "Yes" or "No" if it was a checkblock. Then, when you run it through the converter it will come out as plain text.

This required regex rather than re.

I'm not submitting this as a pull request because I'm not sure where you'd want to include this sort of pre-processing. But if you want to do so, here is the function:



import regex

def flattenrtffields(rawrtf):

    #get all "fields" including nested
    fieldsearch=regex.compile(r"{\\field[^{]*?({(?>[^{}]+|(?1))*})({(?>[^{}]+|(?1))*})}")
    m = fieldsearch.finditer(rawrtf)
    if m:

      textboxes,drops,checks=[],[],[]
      checkboxoptions=["No","Yes"]

      #Make lists of the kinds of fields to flatten
      for field in m:
        if "FORMTEXT" in field[0]:
          textboxes.append(field[0])
        elif "FORMDROPDOWN" in field[0]:
          drops.append(field[0])
        elif "FORMCHECKBOX" in field[0]:
          checks.append(field[0])
        else:
          pass

      #deal with textboxes
      for textbox in textboxes:
        try:
          result = regex.search(r"fldrslt ({(?>[^{}]+|(?1))*})}",textbox)[1]
          if result:
            rawrtf=rawrtf.replace(textbox,result)
        except:
          pass

      #deal with dropdownlists
      for drop in drops:
        try:
          ddresult = regex.search(r"fftype2.*ffres([0-9]*)",drop)[1]
          if ddresult=="25":
            ddresult=regex.search(r"ffdefres([0-9]*)",drop)[1]
          ddlist = re.findall(r"ffl ([^}]*)}",drop)
          rawrtf=rawrtf.replace(drop,"{\\rtlch "+ddlist[int(ddresult)]+"}")
        except:
          pass

      #deal with checkboxes
      for check in checks:
        try:
          result = regex.search(r"fftype1.*ffres([0-9]*)",check)[1]
          if result=="25":
            result=regex.search(r"ffdefres([0-9]*)",check)[1]
          rawrtf=rawrtf.replace(check,"{\\rtlch "+checkboxoptions[int(ddresult)]+"}")
        except:
          pass

    return rawrtf

Feature request: Add support for reading simple RTF tables

Currently there is no support for reading a table from RTF. This should be added to meet with design goals.

rtf reader: unichr() causes ValueError ()

I have a rtf file with strange unicode strings (send you an email).

This causes rtf reader to throw ValueError:

* Module pyth.plugins.rtf15.reader, line 93, in read
* Module pyth.plugins.rtf15.reader, line 113, in go
* Module pyth.plugins.rtf15.reader, line 141, in parse
* Module pyth.plugins.rtf15.reader, line 369, in handle
* Module pyth.plugins.rtf15.reader, line 476, in handle_u
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

The reason why is, my python was build without support for "wide" Unicode characters. (http://www.python.org/dev/peps/pep-0261/). However, an exception handling would be nice.

Request - Support for parsing string instead of file

Looking for the capability to parse the string content instead of reading from a file. More like an overload of Ptf15Reader.Read() function which excepts string as an input.

My code is reading the content from a database table and one of the columns contains RTF content which needs to be parsed as plain text.

For example: Column content
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}}{\colortbl ;\red0\green0\blue0;}{\*\generator Riched20 15.0.4811}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\f0\fs20\lang2057 hi\f1\lang1033\par{\*\lyncflags<rtf=1>}}

Should be parsed as:

hi

Is there a way to achieve it using this utility?

Exclude images when writing plain text files

Converting an RTF with images to a plain text file writes the binary data of the image to the plain text file. When I create a plain text file, I expect that the images will be stripped.

Bump to 0.6.1?

Any chance you can cut a release and bump the PyPI version? Thanks!

Hex (encoded images) in `\pict` control groups is not removed.

When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict control group itself is gone.

At the moment, I'm preprocessing these files to wipe out the pict group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.

I can send you an example, if needed.

brendonh / pyth Goto Github PK

pyth's Introduction

pyth's People

Contributors

Stargazers

Watchers

Forkers

pyth's Issues

Recommend Projects

Recommend Topics

Recommend Org