jezhiggins / arabica Goto Github PK
View Code? Open in Web Editor NEWArabica is an XML and HTML processing toolkit, providing SAX2, DOM, XPath, and XSLT implementations, written in Standard C++
License: Other
Arabica is an XML and HTML processing toolkit, providing SAX2, DOM, XPath, and XSLT implementations, written in Standard C++
License: Other
➜ arabica git:(main) autoreconf
configure.ac:8: error: required file './compile' not found
configure.ac:8: 'automake --add-missing' can install 'compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: error: required file './missing' not found
configure.ac:3: 'automake --add-missing' can install 'missing'
parallel-tests: error: required file './test-driver' not found
parallel-tests: 'automake --add-missing' can install 'test-driver'
autoreconf: automake failed with exit status: 1
➜ arabica git:(main) ✗ automake --add-missing
configure.ac:8: installing './compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: installing './missing'
parallel-tests: installing './test-driver'
➜ arabica git:(main) ✗ autoreconf
configure.ac:8: error: required file './ltmain.sh' not found
autoreconf: automake failed with exit status: 1
➜ arabica git:(main) ✗ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.10
Release: 20.10
Codename: groovy
While XPath tests run successfully on the 2012 November release, they are broken in the current revison:
/home/sunkiss/arabica-master/tests/XPath' make[1]:
xpath_test' is up to date.xpath_test_silly' is up to date. make[1]:
xpath_test_wide' is up to date./home/sunkiss/arabica-master/tests/XPath' make check-TESTS make[1]: Entering directory
/home/sunkiss/arabica-master/tests/XPath'/home/sunkiss/arabica-master/tests/XPath' ../../test-driver: line 95: 8497 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test ../../test-driver: line 95: 8516 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_silly ../../test-driver: line 95: 8535 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_wide make[3]: Entering directory
/home/sunkiss/arabica-master/tests/XPath'all'. make[3]: Leaving directory
/home/sunkiss/arabica-master/tests/XPath'make[2]: *** [test-suite.log] Error 1
make[2]: Leaving directory /home/sunkiss/arabica-master/tests/XPath' make[1]: *** [check-TESTS] Error 2 make[1]: Leaving directory
/home/sunkiss/arabica-master/tests/XPath'
make: *** [check-am] Error 2
all the generated .log files are empty
In the objective to make an intern C++ XML Parsing benchmark, I'm using Arabica as a candidate with xerces expat an libxml2. We actually use an old version of xerces (2.7.0), so i need to compare to this.
I found that Arabica doesn't build with xerces libs versions like 2.7.0 or 2.8.0 mainly because of XMLFilePos and XMLSize_t missing definitions, which is auto-managed in xerces versions greater than 3.0.0.
Adding this part of code in saxxerces.hpp seems to fix it, but maybe it's a little easy and I didn't test a lot.
in (...)/include/SAX/wrappers/saxxerces.hpp:
...
namespace Arabica
{
namespace SAX
{
#ifndef XERCES_AUTOCONF
typedef unsigned int XMLFilePos;
typedef unsigned int XMLSize_t;
#endif
...
Are you ok to fix modern versions of Arabica to build with old xerces library versions ?
Thanks
I think it would be nice to add support for it. It is fast, tiny, stable and free. It supports XPath1.0, and misses XSLT.
Some benchmark comparisons would be nice too :)
Arabica is not thread safe even when used with objects that aren't shared. The problem lies in the boost Spirit classic Api which can't be used safely in multi threading environments (see https://svn.boost.org/trac/boost/ticket/5520). The problem can be easily reproduced by creating several threads which either compile Xpath expressions or parse XML in distinct objects. It seems that only switching to Boost Spirit V2 solves the problem.
I've build arabica
on Windows using Visual C++ 2013 64bit.
Then I've downloaded the OpenstreetMap Data Extracts for Antartica (63MB bz2, 897MB xml) and issued a > transcode.exe -i antarctica-latest.osm -o antartica.xml -ie utf8 -oe utf16le
.
It took 84 seconds but the memory peak was 4GB.
Any other continent (Central America for example 368MB bz2, ~5GB xmlfile) doesn't get converted.
By looking at transcode's conversion function one would think that streams were flushed after 1024 characters:
void wchar_transcode()
{
#ifndef ARABICA_NO_WCHAR_T
int count = 0;
wchar_t c = iCharAdaptor.get();
while(!iCharAdaptor.eof())
{
oCharAdaptor << c;
if(count == 1024)
{
oCharAdaptor.flush();
oByteConvertor.flush();
count = 0;
} // if ...
c = iCharAdaptor.get();
}
oCharAdaptor.flush();
oByteConvertor.flush();
#endif
} // wchar_transcode
But obviously that's not the case.
I've added a few cerr
messages in <io/convert_adaptor.hpp>
. The output for antartica looks like this:
inbuffer size: 1028
inbuffer size: 1028
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
inbuffer size: 2060
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
inbuffer size: 4124
flushOut
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
outbuffer size: 2147483648
flushOut
Transcoding took: 84 seconds
flushOut
flushOut
As you can see flushOut is not called after every oCharAdaptor.flush();
.
I think transcode is a cool utility, but the fact that it can't handle big files is a show stopper.
Hi Jez,
It seems that HTML parser doesn't handle escaped chinese characters correctly. For example, when it finds 这
it casts the value 36825 to a char, so it is converted to 217.
Any workaround to avoid this issue?
Thanks in advance and thank you very much for this parser ;-)
The libxml2 wrapper will parse cdata sections as normal text as lwit_characters is registered as the cdata handler.
We package this software in Homebrew, and could use a new tagged release given the clang (C++11) fixes that have gone in since 2012.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.