jezhiggins / arabica Goto Github PK

Arabica is an XML and HTML processing toolkit, providing SAX2, DOM, XPath, and XSLT implementations, written in Standard C++

License: Other

Shell 0.29% XSLT 37.14% C++ 47.12% Python 0.01% ASP 0.01% Assembly 0.01% CMake 0.18% HTML 2.41% PHP 0.03% Batchfile 12.14% Makefile 0.18% M4 0.46% Pascal 0.01% NASL 0.02%

arabica's People

Contributors

Stargazers

Watchers

arabica's Issues

Ubuntu configure errors

➜  arabica git:(main) autoreconf
configure.ac:8: error: required file './compile' not found
configure.ac:8:   'automake --add-missing' can install 'compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: error: required file './missing' not found
configure.ac:3:   'automake --add-missing' can install 'missing'
parallel-tests: error: required file './test-driver' not found
parallel-tests:   'automake --add-missing' can install 'test-driver'
autoreconf: automake failed with exit status: 1
➜  arabica git:(main) ✗ automake --add-missing
configure.ac:8: installing './compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: installing './missing'
parallel-tests: installing './test-driver'
➜  arabica git:(main) ✗ autoreconf
configure.ac:8: error: required file './ltmain.sh' not found
autoreconf: automake failed with exit status: 1
➜  arabica git:(main) ✗ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.10
Release:        20.10
Codename:       groovy

XPath tests fail on latest revision [parser=libxml2]

While XPath tests run successfully on the 2012 November release, they are broken in the current revison:

[sunkiss@sunfire XPath]$ make check
make xpath_test xpath_test_silly xpath_test_wide
make[1]: Entering directory `/home/sunkiss/arabica-master/tests/XPath' make[1]:`xpath_test' is up to date.
make[1]: `xpath_test_silly' is up to date. make[1]:`xpath_test_wide' is up to date.
make[1]: Leaving directory `/home/sunkiss/arabica-master/tests/XPath' make check-TESTS make[1]: Entering directory`/home/sunkiss/arabica-master/tests/XPath'
make[2]: Entering directory `/home/sunkiss/arabica-master/tests/XPath' ../../test-driver: line 95: 8497 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test ../../test-driver: line 95: 8516 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_silly ../../test-driver: line 95: 8535 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_wide make[3]: Entering directory`/home/sunkiss/arabica-master/tests/XPath'
make[3]: Nothing to be done for `all'. make[3]: Leaving directory`/home/sunkiss/arabica-master/tests/XPath'

Testsuite summary for Arabica 2013-Sometime

TOTAL: 3

PASS: 0

SKIP: 0

XFAIL: 0

FAIL: 3

XPASS: 0

ERROR: 0

============================================================================
See tests/XPath/test-suite.log
Please report to [email protected]

make[2]: *** [test-suite.log] Error 1
make[2]: Leaving directory /home/sunkiss/arabica-master/tests/XPath' make[1]: *** [check-TESTS] Error 2 make[1]: Leaving directory/home/sunkiss/arabica-master/tests/XPath'
make: *** [check-am] Error 2

all the generated .log files are empty

Arabica doesn't build with xerces versions lower than v3.x.x

In the objective to make an intern C++ XML Parsing benchmark, I'm using Arabica as a candidate with xerces expat an libxml2. We actually use an old version of xerces (2.7.0), so i need to compare to this.

I found that Arabica doesn't build with xerces libs versions like 2.7.0 or 2.8.0 mainly because of XMLFilePos and XMLSize_t missing definitions, which is auto-managed in xerces versions greater than 3.0.0.

Adding this part of code in saxxerces.hpp seems to fix it, but maybe it's a little easy and I didn't test a lot.

in (...)/include/SAX/wrappers/saxxerces.hpp:

...
namespace Arabica
{
namespace SAX
{

#ifndef XERCES_AUTOCONF
typedef unsigned int XMLFilePos; 
typedef unsigned int XMLSize_t;
#endif

...

Are you ok to fix modern versions of Arabica to build with old xerces library versions ?

Thanks

Enhacement idea: pugixml backend

I think it would be nice to add support for it. It is fast, tiny, stable and free. It supports XPath1.0, and misses XSLT.

Some benchmark comparisons would be nice too :)

Arabica crashes randomly in multi threaded environments

Arabica is not thread safe even when used with objects that aren't shared. The problem lies in the boost Spirit classic Api which can't be used safely in multi threading environments (see https://svn.boost.org/trac/boost/ticket/5520). The problem can be easily reproduced by creating several threads which either compile Xpath expressions or parse XML in distinct objects. It seems that only switching to Boost Spirit V2 solves the problem.

transcode doesn't handle big xml files

I've build arabica on Windows using Visual C++ 2013 64bit.

Then I've downloaded the OpenstreetMap Data Extracts for Antartica (63MB bz2, 897MB xml) and issued a > transcode.exe -i antarctica-latest.osm -o antartica.xml -ie utf8 -oe utf16le.

It took 84 seconds but the memory peak was 4GB.

Any other continent (Central America for example 368MB bz2, ~5GB xmlfile) doesn't get converted.

By looking at transcode's conversion function one would think that streams were flushed after 1024 characters:

void wchar_transcode()
{
#ifndef ARABICA_NO_WCHAR_T  
  int count = 0;
  wchar_t c = iCharAdaptor.get();
  while(!iCharAdaptor.eof())
  {
    oCharAdaptor << c;

    if(count == 1024)
    {
      oCharAdaptor.flush();
      oByteConvertor.flush();
      count = 0;
    } // if ... 

    c = iCharAdaptor.get();
  }
  oCharAdaptor.flush();
  oByteConvertor.flush(); 
#endif
} // wchar_transcode

But obviously that's not the case.

I've added a few cerr messages in <io/convert_adaptor.hpp>. The output for antartica looks like this:

inbuffer size: 1028
inbuffer size: 1028
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
inbuffer size: 2060
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
inbuffer size: 4124
flushOut
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
outbuffer size: 2147483648
flushOut
Transcoding took: 84 seconds
flushOut
flushOut

As you can see flushOut is not called after every oCharAdaptor.flush();.

I think transcode is a cool utility, but the fact that it can't handle big files is a show stopper.

Arabica doesn't handle escaped chinese characters in the right way

Hi Jez,

It seems that HTML parser doesn't handle escaped chinese characters correctly. For example, when it finds 这 it casts the value 36825 to a char, so it is converted to 217.

Any workaround to avoid this issue?

Thanks in advance and thank you very much for this parser ;-)