Giter Club home page Giter Club logo

arabica's People

Contributors

ashb avatar benkey avatar benkeyfsi avatar dknibbe avatar eburkitt avatar jezhiggins avatar qfiard avatar sterad avatar stv0g avatar wassasin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arabica's Issues

Ubuntu configure errors

➜  arabica git:(main) autoreconf
configure.ac:8: error: required file './compile' not found
configure.ac:8:   'automake --add-missing' can install 'compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: error: required file './missing' not found
configure.ac:3:   'automake --add-missing' can install 'missing'
parallel-tests: error: required file './test-driver' not found
parallel-tests:   'automake --add-missing' can install 'test-driver'
autoreconf: automake failed with exit status: 1
➜  arabica git:(main) ✗ automake --add-missing
configure.ac:8: installing './compile'
configure.ac:8: error: required file './ltmain.sh' not found
configure.ac:3: installing './missing'
parallel-tests: installing './test-driver'
➜  arabica git:(main) ✗ autoreconf
configure.ac:8: error: required file './ltmain.sh' not found
autoreconf: automake failed with exit status: 1
➜  arabica git:(main) ✗ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.10
Release:        20.10
Codename:       groovy

XPath tests fail on latest revision [parser=libxml2]

While XPath tests run successfully on the 2012 November release, they are broken in the current revison:

[sunkiss@sunfire XPath]$ make check
make xpath_test xpath_test_silly xpath_test_wide
make[1]: Entering directory /home/sunkiss/arabica-master/tests/XPath' make[1]:xpath_test' is up to date.
make[1]: xpath_test_silly' is up to date. make[1]:xpath_test_wide' is up to date.
make[1]: Leaving directory /home/sunkiss/arabica-master/tests/XPath' make check-TESTS make[1]: Entering directory/home/sunkiss/arabica-master/tests/XPath'
make[2]: Entering directory /home/sunkiss/arabica-master/tests/XPath' ../../test-driver: line 95: 8497 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test ../../test-driver: line 95: 8516 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_silly ../../test-driver: line 95: 8535 Segmentation fault "$@" > $log_file 2>&1 FAIL: xpath_test_wide make[3]: Entering directory/home/sunkiss/arabica-master/tests/XPath'
make[3]: Nothing to be done for all'. make[3]: Leaving directory/home/sunkiss/arabica-master/tests/XPath'

Testsuite summary for Arabica 2013-Sometime

TOTAL: 3

PASS: 0

SKIP: 0

XFAIL: 0

FAIL: 3

XPASS: 0

ERROR: 0

============================================================================
See tests/XPath/test-suite.log
Please report to [email protected]

make[2]: *** [test-suite.log] Error 1
make[2]: Leaving directory /home/sunkiss/arabica-master/tests/XPath' make[1]: *** [check-TESTS] Error 2 make[1]: Leaving directory/home/sunkiss/arabica-master/tests/XPath'
make: *** [check-am] Error 2

all the generated .log files are empty

Arabica doesn't build with xerces versions lower than v3.x.x

In the objective to make an intern C++ XML Parsing benchmark, I'm using Arabica as a candidate with xerces expat an libxml2. We actually use an old version of xerces (2.7.0), so i need to compare to this.

I found that Arabica doesn't build with xerces libs versions like 2.7.0 or 2.8.0 mainly because of XMLFilePos and XMLSize_t missing definitions, which is auto-managed in xerces versions greater than 3.0.0.

Adding this part of code in saxxerces.hpp seems to fix it, but maybe it's a little easy and I didn't test a lot.

in (...)/include/SAX/wrappers/saxxerces.hpp:

...
namespace Arabica
{
namespace SAX
{

#ifndef XERCES_AUTOCONF
typedef unsigned int XMLFilePos; 
typedef unsigned int XMLSize_t;
#endif

...

Are you ok to fix modern versions of Arabica to build with old xerces library versions ?

Thanks

Enhacement idea: pugixml backend

I think it would be nice to add support for it. It is fast, tiny, stable and free. It supports XPath1.0, and misses XSLT.

Some benchmark comparisons would be nice too :)

Arabica crashes randomly in multi threaded environments

Arabica is not thread safe even when used with objects that aren't shared. The problem lies in the boost Spirit classic Api which can't be used safely in multi threading environments (see https://svn.boost.org/trac/boost/ticket/5520). The problem can be easily reproduced by creating several threads which either compile Xpath expressions or parse XML in distinct objects. It seems that only switching to Boost Spirit V2 solves the problem.

transcode doesn't handle big xml files

I've build arabica on Windows using Visual C++ 2013 64bit.

Then I've downloaded the OpenstreetMap Data Extracts for Antartica (63MB bz2, 897MB xml) and issued a > transcode.exe -i antarctica-latest.osm -o antartica.xml -ie utf8 -oe utf16le.

It took 84 seconds but the memory peak was 4GB.

Any other continent (Central America for example 368MB bz2, ~5GB xmlfile) doesn't get converted.

By looking at transcode's conversion function one would think that streams were flushed after 1024 characters:

void wchar_transcode()
{
#ifndef ARABICA_NO_WCHAR_T  
  int count = 0;
  wchar_t c = iCharAdaptor.get();
  while(!iCharAdaptor.eof())
  {
    oCharAdaptor << c;

    if(count == 1024)
    {
      oCharAdaptor.flush();
      oByteConvertor.flush();
      count = 0;
    } // if ... 

    c = iCharAdaptor.get();
  }
  oCharAdaptor.flush();
  oByteConvertor.flush(); 
#endif
} // wchar_transcode

But obviously that's not the case.

I've added a few cerr messages in <io/convert_adaptor.hpp>. The output for antartica looks like this:

inbuffer size: 1028
inbuffer size: 1028
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
inbuffer size: 2060
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
inbuffer size: 4124
flushOut
outbuffer size: 1024
outbuffer size: 2048
outbuffer size: 4096
outbuffer size: 8192
outbuffer size: 16384
outbuffer size: 32768
outbuffer size: 65536
outbuffer size: 131072
outbuffer size: 262144
outbuffer size: 524288
outbuffer size: 1048576
outbuffer size: 2097152
outbuffer size: 4194304
outbuffer size: 8388608
outbuffer size: 16777216
outbuffer size: 33554432
outbuffer size: 67108864
outbuffer size: 134217728
outbuffer size: 268435456
outbuffer size: 536870912
outbuffer size: 1073741824
outbuffer size: 2147483648
flushOut
Transcoding took: 84 seconds
flushOut
flushOut

As you can see flushOut is not called after every oCharAdaptor.flush();.

I think transcode is a cool utility, but the fact that it can't handle big files is a show stopper.

Arabica doesn't handle escaped chinese characters in the right way

Hi Jez,

It seems that HTML parser doesn't handle escaped chinese characters correctly. For example, when it finds &#36825; it casts the value 36825 to a char, so it is converted to 217.

Any workaround to avoid this issue?

Thanks in advance and thank you very much for this parser ;-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.