zhujiangang / wikixmlj Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 6.07 MB

Automatically exported from code.google.com/p/wikixmlj

Java 100.00%

wikixmlj's People

Contributors

Watchers

wikixmlj's Issues

Special page detection

Titles containing ": " are not special pages.

Original issue reported on code.google.com by [email protected] on 2 Nov 2012 at 4:01

Gettting Error

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
I am just calling the constructor and then parse.
I get the following error -:

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/tools/bzip2/CBZip2InputStream
        at edu.jhu.nlp.wikipedia.WikiXMLParserFactory.getSAXParser(WikiXMLParserFactory.java:15)
        at tempProject.Main.main(Main.java:41)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tools.bzip2.CBZip2InputStream
        at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
        ... 2 more
Java Result: 1

What version of the product are you using? On what operating system?
wikixmlj-r43.jar on Ubuntu 10.10

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 19 Jan 2011 at 6:49

improve getText() output

make sure it handles most cases. Currently only a simple pattern match is
performed.

Original issue reported on code.google.com by [email protected] on 6 Aug 2009 at 11:06

isDisambiguationPage() fails for some pages

Felipe reports isDisambiguationPage() does not work for pages like 
http://en.wikipedia.org/wiki/Lexington

Original issue reported on code.google.com by [email protected] on 10 Aug 2009 at 2:30

getCategories() works only for english as of now

This is due to the hardcoded "Category:" string. Wikipedia for other
languages use different surface forms. This can be easily resolved but save
it for later.

Original issue reported on code.google.com by [email protected] on 11 Oct 2008 at 1:10

page.getInfoBox() for the SAX parser seems to be defective

What steps will reproduce the problem?
1. Consider any XML dump (with pages having an infobox) and run the XML Parser.
2. In the callback, print the infobox using getInfoBox() method.
3. The method is always returning null.
4. On printing the output of getText(), it can be observed that infobox is 
getting printed with it.

What is the expected output? What do you see instead?
Expected output is the infobox for each page. "null" is given as output (always)

What version of the product are you using? On what operating system?
I am using r43 on Ubunt

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Jan 2012 at 10:59

unicode characters

What steps will reproduce the problem?
1. Parse the XML file,
2. Find that the Unicode encoding is plain wrong

Not sure how this wasn't noticed as a serious error before?
I'm not sure what I'm doing differently, that requires this, and how it
would work for others?  Strange.


For me, adding "UTF8" as the encoding of the InputStreamReader, fixed
everything so the Unicode characters are read in correctly.


    protected InputSource getInputSource() throws Exception
    {
        BufferedReader br = null;

        if(wikiXMLFile.endsWith(".gz")) {
            br = new BufferedReader(new InputStreamReader(
                    new GZIPInputStream(new FileInputStream(wikiXMLFile)), "UTF8"));
        } else if(wikiXMLFile.endsWith(".bz2")) {
            FileInputStream fis = new FileInputStream(wikiXMLFile);
            byte [] ignoreBytes = new byte[2];
            fis.read(ignoreBytes); //"B", "Z" bytes from commandline tools
            br = new BufferedReader(new InputStreamReader(
                    new CBZip2InputStream(fis), "UTF8"));
        } else {
            br = new BufferedReader(new InputStreamReader(
                new FileInputStream(wikiXMLFile), "UTF8"));
        }

        return new InputSource(br);
    }

Original issue reported on code.google.com by [email protected] on 1 Apr 2010 at 4:37

no urdu xml parsing support

What steps will reproduce the problem?
1. Language.urdu not available // Urdu support not available
2.
3.

What is the expected output? What do you see instead?
i need to extract infoboxes from XML file for Urdu language

What version of the product are you using? On what operating system?
latest version

Please provide any additional information below.
In xml file there's same structure as in English kindly help me regarding this 
issue

Original issue reported on code.google.com by [email protected] on 29 Jul 2013 at 11:09

SAXParser should support the getIterator() method.

It should be possible to emulate an iterator by registering callbacks. For
this purpose, a default handler called IteratorHandler should be
registered. This will require redesign of WikiPageIterator class.

Original issue reported on code.google.com by [email protected] on 8 May 2009 at 6:13

Issue with "redirect" page detection

What steps will reproduce the problem?
<text xml:space="preserve">#REDIRECT [[Albert Smith (footballer born 
1900)]]</text>

The redirect page detection only covers the case "#REDIRECT ", but not
"#redirect"
For example:
      <text xml:space="preserve">#redirect [[Vacuous truth]]</text>
      <text xml:space="preserve">#redirect[[Hybrid vehicle]]</text>
      <text xml:space="preserve">#Redirect [[Zhou Dynasty]]</text>
      <text xml:space="preserve">#REDirect [[Glenn Medeiros]]</text>

One solution is to match the pattern using case insensitive.

What is the expected output? What do you see instead?
Should be able to handle the other case "#redirect"

What version of the product are you using? On what operating system?
r43, linux

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 2 Nov 2012 at 3:58

java.lang.ArrayIndexOutOfBoundsException: 0

What steps will reproduce the problem?
1. Index enwiki-20081008-pages-articles.xml.bz2
2. Count down every page parsed
3. At page 210200 or so the parser throws the following exception:

java.lang.ArrayIndexOutOfBoundsException: 0
    at edu.jhu.nlp.wikipedia.WikiTextParser.parseLinks(WikiTextParser.java:71)
    at edu.jhu.nlp.wikipedia.WikiTextParser.getLinks(WikiTextParser.java:50)
    at edu.jhu.nlp.wikipedia.WikiPage.getLinks(WikiPage.java:104)
    at Test$1.process(Test.java:78)
    at
edu.jhu.nlp.wikipedia.SAXPageCallbackHandler.endElement(SAXPageCallbackHandler.j
ava:42)
    at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Abstract
SAXParser.java:604)
    at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndEl
ement(XMLDocumentFragmentScannerImpl.java:1750)
    at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentC
ontentDriver.next(XMLDocumentFragmentScannerImpl.java:2906)
    at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentS
cannerImpl.java:624)
    at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocum
entScannerImpl.java:116)
    at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocum
ent(XMLDocumentFragmentScannerImpl.java:486)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:810)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:740)
    at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:110)
    at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXPa
rser.java:1208)
    at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:47)
    at Test.main(Test.java:103)


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 5 Aug 2009 at 9:08

DOMParser runs out of memory

DOM Parser code runs out of memory for large dumps even in callback mode
despite of the "defer-node-expansion" option.

This is because the parse() method currently builds a Vector<WikiPage>
object. This can be avoided by a redesign of the WikiPageIterator class.

Original issue reported on code.google.com by [email protected] on 8 May 2009 at 6:08

parsing page ID is truncated if the value crosses the buffer's 2048 character boundary.


In class SAXPageCallbackHandler.characters()

The logic that is commented with the following comment ...

// TODO: To avoid looking at the revision ID, only the first ID is taken.
// I'm not sure how big the block size is in each call to characters(),
// so this may be unsafe.

... is indeed unsafe across buffer boundaries.
If the page id is split across two buffers then it is truncated.
So 450123 with  the "450" starting at position 2045 of the ch[] and the "123" 
starting at position 0 of ch[] of the next call will yield a result of "450".

Original issue reported on code.google.com by [email protected] on 18 Nov 2012 at 1:55

cannot parse bzip2 file


Since, I have space issues, I want to parse the .bz2 file itself without 
unzipping it.
I get the following error:

java -cp wikixmlj-r43.jar:.:bzip2.jar:.:xercesImpl-2.9.1.jar Test 
enwiki-20130503-pages-articles-multistream.xml.bz2 

[Fatal Error] :38:1: XML document structures must start and end within the same 
entity.
org.xml.sax.SAXParseException: XML document structures must start and end 
within the same entity.
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:58)
    at Test.main(Test.java:25)

Does wikixmlj parse .bz2 files or does it work only on uncompressed xml files??

Original issue reported on code.google.com by [email protected] on 10 Jun 2013 at 7:02

SAX parser code insanely slow

Doing concat on the strings for SAXPageCallbackHandler, is amazingly slow.

Changing it to simply use a StringBuffer:

    private StringBuffer currentWikitext;
    private StringBuffer currentTitle;

and changing concat to do append, 

I dunno. gives maybe a 40x increase in speed?

Original issue reported on code.google.com by [email protected] on 2 Apr 2010 at 12:18

add isStub() method to WikiPage

This is missing and should be added.

Original issue reported on code.google.com by [email protected] on 6 Aug 2009 at 11:05

SAX Parser is adding characters not inside current tag

Hi,

I was looking at the source code from SAXPageCallbackHandler.java and noticed 
that you do not reset the currentTag variable when processing the endElement() 
function. This has the effect of adding characters outside any tag to the 
previous tag.

best regards,
--ivo

Original issue reported on code.google.com by ivo.anastacio on 30 Nov 2011 at 11:44

StringIndexOutOfBoundsException in parseInfoBox in class WikiTextParser

Once in a while, WikiTextParser (line 132) throws an exception. I assume this 
is whenever the InfoBox isn't properly closed, so I guess it's not actually a 
problem of WikiTextParser, but it could be handled more gracefully, e.g. by 
testing that (endPos+1 < wikiText.length()).

Stacktrace:

java.lang.StringIndexOutOfBoundsException: String index out of range: 4877
        at java.lang.String.substring(String.java:1934)
        at edu.jhu.nlp.wikipedia.WikiTextParser.parseInfoBox(WikiTextParser.java:132)
        at edu.jhu.nlp.wikipedia.WikiTextParser.getInfoBox(WikiTextParser.java:110)
        at edu.jhu.nlp.wikipedia.WikiPage.getInfoBox(WikiPage.java:136)

Original issue reported on code.google.com by [email protected] on 30 Jun 2010 at 9:08

zhujiangang / wikixmlj Goto Github PK

wikixmlj's People

Contributors

Watchers

wikixmlj's Issues

Recommend Projects

Recommend Topics

Recommend Org