Giter Club home page Giter Club logo

xmlutils.py's Introduction

NOTICE (May 2017): I'm no longer able to maintain this project or look at posted issues. If you're interested in maintaining, please let me know.

xmlutils.py

xmlutils.py is a set of Python utilities for processing xml files serially
for converting them to various formats (SQL, CSV, JSON). The scripts use ElementTree.iterparse() to iterate through nodes in an XML document, thus not needing to load the entire DOM into memory. The scripts can be used to churn through large XML files (albeit taking long :P) without memory hiccups.

Simple table-representing XMLs can be converted to CSV using xmltable2csv. It assumes each entry is encapsulated in some tag, and successfuly tested on some XLSX files.

Blind conversion of XML to CSV and SQL is not recommended. It only works if the structure of the XML document is simple (flat). On the other hand, xml2json supports complex XML documents with multiple nested hierarchies. Lastly, the XML files are not validated at the time of conversion.

Installation

With pip or easy_install

pip install xmlutils or easy_install xmlutils

Or from the source

python setup.py install

Commandline utilities

Once the package is installed, the three bundled commandline utilities should be available from the terminal.

xml2csv

Convert an XML document to a CSV file.

xml2csv --input "samples/fruits.xml" --output "samples/fruits.csv" --tag "item"
Arguments
--input 	Input XML document's filename*
--output 	Output CSV file's filename*
--tag 		The tag of the node that represents a single record (Eg: item, record)*
--delimiter 	Delimiter for seperating items in a row. Default is , (a comma followed by a space)
--ignore 	A space separated list of element tags in the XML document to ignore
--noheader 	Exclude CSV fields header (first line). Off by default
--encoding 	Character encoding of the document. Default is utf-8
--limit 	Limit the number of records to be processed from the document to a particular number. Default is no limit (-1)
--buffer 	The number of records to be kept in memory before it is written to the output CSV file. Helps reduce the number of disk writes. Default is 1000

xmltable2csv

Convert an XML table to a CSV file.

xmltable2csv --input "samples/fruits.xml" --output "samples/fruits.csv" --tag "Data"
Arguments
--input         Input XML table's filename*
--output        Output CSV file's filename*
--tag           The tag of the node that represents a single record (Eg: Data, record)*
--delimiter     Delimiter for seperating items in a row. Default is , (a comma followed by a space)
--header        Whether to print the header (first row of records in the XML) in the first line; 1=yes, 0=no. Default is 1.
--encoding      Character encoding of the document. Default is utf-8
--limit         Limit the number of records to be processed from the document to a particular number. Default is no limit (-1)
--buffer        The number of records to be kept in memory before it is written to the output CSV file. Helps reduce the number of disk writes. Default is 1000.

xml2sql

Convert an XML document to an SQL file.

xml2sql --input "samples/fruits.xml" --output "samples/fruits.sql" --tag "item" --table "myfruits"
Arguments
--tag           the record tag. eg: item
--table         table name
--ignore        list of tags to ignore
--limit         maximum number of records to process
--packet        maximum size of an insert query in MB (MySQL's max_allowed_packet)

xml2json

Convert XML to JSON. xml2json supports hierarchies nested to any number of levels.

xml2json --input "samples/fruits.xml" --output "samples/fruits.json"

Modules

xmlutils.xml2sql

from xmlutils.xml2sql import xml2sql

converter = xml2sql("samples/fruits.xml", "samples/fruits.sql", encoding="utf-8")
converter.convert(tag="item", table="table")
Arguments
tag 	-- the record tag. eg: item
table	-- table name
ignore	-- list of tags to ignore
limit	-- maximum number of records to process
packet	-- maximum size of an insert query in MB (MySQL's max_allowed_packet)

Returns:
{	num: number of records converted,
	num_insert: number of sql insert statements generated
}

xmlutils.xml2csv

from xmlutils.xml2csv import xml2csv

converter = xml2csv("samples/fruits.xml", "samples/fruits.csv", encoding="utf-8")
converter.convert(tag="item")
Arguments
tag	-- the record tag. eg: item
delimiter -- csv field delimiter
ignore	-- list of tags to ignore
limit	-- maximum number of records to process
buffer	-- number of records to keep in buffer before writing to disk

Returns:
number of records converted

xmlutils.xml2json

from xmlutils.xml2json import xml2json

converter = xml2json("samples/fruits.xml", "samples/fruits.json", encoding="utf-8")
converter.convert()

# to get a json string
converter = xml2json("samples/fruits.xml", encoding="utf-8")
print converter.get_json()
Arguments
pretty	-- pretty print?

xmlutils.py's People

Contributors

billwanjohi avatar dunglehome avatar egalev avatar hellosputnik avatar jadavbheda avatar knadh avatar lorcan avatar onemoretime avatar ropp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xmlutils.py's Issues

datapacket XML format

Hi,

I have an xml-file that looks like this

<?xml version="1.0" standalone="yes"?>
<DATAPACKET Version="2.0">
    <METADATA>
        <FIELDS>
            <FIELD attrname="test_attribute" fieldtype="string" WIDTH="6" />
        </FIELDS>
        <PARAMS />
    </METADATA>
    <ROWDATA>
        <ROW test_attribute="value" />
    </ROWDATA>
</DATAPACKET>

Can one adjust your code to apply it to such a file?

Best,

Henning

Skip void elements and specific items values

In my xml, I have some nodes in the chosen tag that have no values, I would like to skip them:

A typical node is like this

<source>Yahoo Business</source><url>http://us.rd.yahoo.com/dailynews/rss/business/*http://story.news.yahoo.com/news?tmpl=story2 u=/nm/20040814/bs_nm/column_stocks_week_dc</url><title>Wall St. Pullback Reflects Tech Blowout (Reuters)</title><image>none</image><category>Business</category><description>Reuters - Wall Street's long-playing drama,\"Waiting for Google," is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.</description><rank>5</rank><pubdate>0000-00-00 00:00:00</pubdate>

and for some of them, I have that the category is "" or it has another value to skip. When I convert to csv doing like

xml2csv --input data/newsspace200.xml --output data/newsspace200_.csv --delimiter $'\t' --tag 'source'

I will get the rows with "" category columni.n the csv file. Is it possibile to skip then those columns?

command line xml2json for multiple records

Excellent package, Kailash. Thank you for sharing it. I'm trying to convert the massive 40GB wikipedia dumps from XML to JSON to use in another program, and most converters I've tried are erroring out.

When I try xml2json, however, I find that it is only converting a single record, so I'm not sure what I'm doing wrong. So, for example, I run...

xml2json --input "enwiki-latest-pages-articles.xml" --output "wiki.json" &

And it only outputs the first JSON record. How would I iterate through all the items in <page>brackets</page> and append them to a single file, either from the command line, or from within a python script (that I can run overnight)?

They're listed as...

<page>
  <field1></field1>
  <field2></field2>
</page>
<page>
  <field1></field1>
  <field2></field2>
</page>
<page>
  <field1></field1>
  <field2></field2>
</page>

Error with "self.next" when running xml2csv.py module

When I enter the command prompt as modelled in the Documentation ----->

(xmlutils.xml2csv
from xmlutils.xml2csv import xml2csv
converter = xml2csv("samples/fruits.xml", "samples/fruits.csv", encoding="utf-8")
converter.convert(tag="item"))

I get the following message. The details of the error are below. Please advise on a solution.

$ python
Python 3.4.3 |Anaconda 2.2.0 (64-bit)| (default, Mar 6 2015, 12:06:10) [MSC v.1
600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

from xmlutils.xml2csv import xmlcsv
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'xmlcsv'
from xmlutils.xml2csv import xml2csv
converter = xml2csv("samples/fruits.xml", "samples/fruits.csv" , encoding =
"utf-8")
coverter.convert(tag = "item")
Traceback (most recent call last):
File "", line 1, in
NameError: name 'coverter' is not defined
converter.convert(tag = "item")
Traceback (most recent call last):
File "", line 1, in
File "c\Python Folder\xmlutils.py\xmlutils\xml2csv.py",
line 56, in convert
event, root = self.context.next()
AttributeError: '_IterParseIterator' object has no attribute 'next'
converter.convert(tag = "item")
Traceback (most recent call last):
File "", line 1, in
File "c\Python Folder\xmlutils.py\xmlutils\xml2csv.py",
line 56, in convert
# event, root = self.context.next()
AttributeError: '_IterParseIterator' object has no attribute 'next'
converter.convert(tag = "item")
Traceback (most recent call last):
File "", line 1, in
File "c\Python Folder\xmlutils.py\xmlutils\xml2csv.py",
line 56, in convert
# event, root = self.context.next()
AttributeError: '_IterParseIterator' object has no attribute 'next'

Order of elements under a node is not mentioned, which create csv with mis-order entry?

A XML node will have fix number of elements and they can be in order from one node to another node. If during conversion a fixed order is not used then the output csv will have mis-matched values. if XML don't have XSD then well formed XML can have elements in any order among different node. During conversion, we can control the output order of elements and hence the csv will write values under header fields. for example, below example have same elements but order are different but in a output csv value must match with appropriate header field.



<files>

<file name="com.nineleaf.tp-1.apk" source="original">
     <format>Android Package Archive</format>
     <mtime>1407440118</mtime>
      <size>27723910</size>
      <md5>2221c4f14bd4178380df53428bc83cd6</md5>
    <crc32>7dedaf45</crc32>
     <sha1>cbb5ecbf3477ad0def0264bf294fbf22271ca313</sha1>
</file>

<file name="com.striker.lalosfeliz-6.apk" source="original">
        <mtime>1413362341</mtime>
       <size>2705561</size>
       <md5>53033d768f5fbc8ec2913d42fce7c9e2</md5>
        <crc32>e6f36b15</crc32>
       <sha1>ae9c4bc0730f67cba372f09c03da27fc954fa5e6</sha1>
       <format>Android Package Archive</format>
</file>
</files>

Push the latest changes to pypi.org

Hi Kailash, Please Push the latest changes that includes support for python 3.x for xml2csv script, as the package available in pypi.org doesn't handle the attribute error.

Thanks!

README documentation is out of date

The documentation on README keeps getting out of sync with the documentation provided by the utility itself. E.g., Issue #18 highlights a problem with the --header option on xml2csv, Here's the relevant documentation as spat out by xml2csv -h:

--header              print csv header (default=True)

and that contained in the README:

--header    Whether to print the CSV header (list of fields) in the first line; 1=yes, 0=no. Default is 1.

I would argue that the documentation spat out by the command line tool is always going to be correct. Either we should continuously rework the online (README) documentation (the preferred option) or drop it in favour of something along the lines of:

xml2csv

Basic usage:

xml2csv --input INPUT_FILE --output OUTPUT_FILE --tag TAG

More detailed usage documentation can be accessed by running xml2csv --help.

Use of `#` in the header line of generated CSV files

Currently the header line is prefixed with the # character, e.g.,

#id,name,price,quantity

Is the prefix strictly-speaking necessary? The closest I can find in a spec for CSV is the RFC 4180, in particular Section 2.3, which might suggest a header line of:

id,name,price,quantity

It's a bit of a niggle maybe, but would it be reasonable to remove the prefix or alternatively add a boolean argument (defaulting to false) to tell xml2csv not to include it? (I'm happy to submit a pull-request if there is general support for either of these suggestions).

xml2csv doesn't work in a loop

I have a script that loops through different high school sports and grabs an xml schedule for each one.

https://gist.github.com/dangayle/6127265

If I use the function to grab a single sport, it works perfectly. If I loop through a simple dict of sports, it creates a new csv file and adds all of the sports data from the preceding loops onto the csv. (You'll want to run the code to see what I mean).

I don't have any idea why it's not working, since Python's garbage collector should be destroying the data from the prior loops.

Thoughts?

Problem specifying tab delimiter

Hi,

I tried using
--delimiter "\t"
for specifying TAB as the delimiter but that didnt work. Tried escaping also
"\\t"
but it didn't work. Resulted CSV prints \t as text and TAB character is not created.

Cannot convert XML to CSV

Tried converting simple XML (below) but get nothing in the output file. Output file is created though. I am using Python v2.7

<?xml version="1.0" encoding="utf-8"?>
<rows>
  <row Id="2" PostId="35314"/> 
</rows>

Tried command line tool which didn't work
xml2csv --input "subset.xml" --output "subset.csv" --tag "row"

Error

image

Tried it through the code which didn't work either

from xmlutils.xml2csv import xml2csv
converter = xml2csv("subset.xml", "subset.csv", encoding="utf-8")
converter.convert(tag="row")

SyntaxError: invalid syntax (related to literal string)

When I run xmltable2csv I get the following error message"

File "/usr/lib/python3.4/site-packages/xmlutils/console.py", line 20
"""
^
SyntaxError: invalid syntax

The offending code is the triple-double quoted string literal as follows

def run_xml2sql():
print """xml2sql by Kailash Nadh (http://nadh.in)
--help for help

"""

What am I doing wrong here? My python is version 3.4, as you can see. I feel I must be making some basic compatibility mistake or some such, as this can not possibly be a real syntax errror, can it?

Incorrect handling of --limit 0

It would be helpful to support --limit 0. This would allow the user to request zero records output, and consequently the user could use this option to get the column headers, and only the column headers.

Currently --limit 0 behaves identically to --limit -1 (the default) which outputs all records. This behavior (outputting all records when limit is zero) is actually a by-product of another minor bug, which is the following test:

if n == limit:
    break

This test happens after n has been incremented and consequently will never break when limit is zero.

Enable Multi-Tag Convert for CSV

This is a very useful tool for an XML file containing homogenous records. However, for XML files that contain various types of records, parsing all the record types requires scanning the file multiple times. For small XML files this is trivial but for larger GB sized XML files this is very slow.

Leveraging what you already have built, could you add a convert_multiple() method that would accept a list of tags and use the output path as the base path with the tag name appended. Something like this:

converter = xml2csv('some/path/my_xml.xml', 'some/path/my_csv_', encoding='utf8')
converter.convert_multiple(tags=('tag1', 'tag2', 'tag3'))

Resulting in output files like this:

my_csv_tag1.csv
my_csv_tag2.csv
my_csv_tag3.csv

checking for start event results in unpopulated element text (xml2csv)

line 59 of xml2csv:

if started and event == 'start' ....

checks for 'start' events. this does not guarantee that the element is actually read in its entirety, and can result in None type elem.text, which causes blank fields in the ouptut.

see,
http://infix.se/2009/05/10/text-safe-xml-processing-with-iterparse

this happened in practice for me. I believe building on 'end' events fixes this problem, i.e.:

if started and event == 'end' ....

with everything else the same.

Needs a 'tag' scanner for xml files

Rather than assuming the xml tag structure, the script needs to essentially scan the xml file BEFORE outputting the contents into CSV so that the column headers can be created first. This will ensure that all tags within the CSV file are accounted for. I have some files you can use for testing this if needs be.

Missing element causes cell data to appear in the wrong column

If an element is missing in the source XML, then the corresponding cell is erroneously omitted from the resulting CSV. This causes values in the resulting CSV to appear in the wrong columns.

This command:

xml2csv --input in.xml --output out.csv --tag "row"

On this input file, in.xml:

<?xml version="1.0"?>
<top>
  <row>
    <a>r1_col_a</a>
    <b>r1_col_b</b>
  </row>
  <row>
    <b>r2_col_b</b>
  </row>
</top>

Produced this output file, out.csv:

a,b
"r1_col_a","r1_col_b"
"r2_col_b"

Notice that r2_col_b is in column a, when it should be in column b.

xml2csv generates comma plus space delimited files rather than comma-spaced

The default separator set for xml2csv in console.py is ", " (a comma and a space). Perhaps it should be "," (just a comma)?
In this way we get files with lines like:

#id,name,price,quantity
"1000","Apple","4","133"

rather than

#id, name, price, quantity
"1000", "Apple", "4", "133"

by default. I'm submitting a pull request now, if this is worth changing.
Great work on xmlutils :-)

How to extract node attributes as field in resulting csv or json?

For example if a XML file have below node,
<file name="com.nineleaf.tp-1.apk" source="original"> <format>Android Package Archive</format> <mtime>1407440118</mtime> <size>27723910</size> <md5>2221c4f14bd4178380df53428bc83cd6</md5> <crc32>7dedaf45</crc32> <sha1>cbb5ecbf3477ad0def0264bf294fbf22271ca313</sha1> </file>

Now, the resulting csv should have file-name and file-source (node name + attribute name) as a field and these values should go under these fields.

So, resulting csv header should be like.

<file-name, file-source, format, mtime, size, md5, crc32, sha1>

won't be able to execute xml2json even after successfully setting up the environment

I built the environment successfully, and trying to run the command from directory xmlutils.py
xml2json --input "/sample/logs.xml" --output "/sample/logs.json"

getting this error:

Traceback (most recent call last): File "/home/xalteddev/anaconda3/bin/xml2json", line 33, in <module> sys.exit(load_entry_point('xmlutils==1.4.1', 'console_scripts', 'xml2json')()) File "/home/xalteddev/anaconda3/lib/python3.8/site-packages/xmlutils/console.py", line 109, in run_xml2json parser.add_argument('--input', type=file, dest='input_file', required=True, help='input xml filename') NameError: name 'file' is not defined

can't convert from string

i want to converter xml on the fly like

converter = xml2json(message, encoding="utf-8")

I wanted to convert xml from a variable rather than file as explained in the examples.

Documentation for --header option is incorrect

I wanted a header. Documentation said "yes" was default, but I used --header to make sure and got no header. Then tried --header 1 and that was an error. Finally ran it without the --header and got the result I was looking for. I don't think this result matches the result I was expecting based on the documentation.

My suggestion: two options, --header
--no-header

with the obvious effect applied...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.