zhujiangang / wikixmlj Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/wikixmlj
Automatically exported from code.google.com/p/wikixmlj
Titles containing ": " are not special pages.
Original issue reported on code.google.com by [email protected]
on 2 Nov 2012 at 4:01
What steps will reproduce the problem?
1.
2.
3.
What is the expected output? What do you see instead?
I am just calling the constructor and then parse.
I get the following error -:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/tools/bzip2/CBZip2InputStream
at edu.jhu.nlp.wikipedia.WikiXMLParserFactory.getSAXParser(WikiXMLParserFactory.java:15)
at tempProject.Main.main(Main.java:41)
Caused by: java.lang.ClassNotFoundException:
org.apache.tools.bzip2.CBZip2InputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 2 more
Java Result: 1
What version of the product are you using? On what operating system?
wikixmlj-r43.jar on Ubuntu 10.10
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 19 Jan 2011 at 6:49
make sure it handles most cases. Currently only a simple pattern match is
performed.
Original issue reported on code.google.com by [email protected]
on 6 Aug 2009 at 11:06
Felipe reports isDisambiguationPage() does not work for pages like
http://en.wikipedia.org/wiki/Lexington
Original issue reported on code.google.com by [email protected]
on 10 Aug 2009 at 2:30
This is due to the hardcoded "Category:" string. Wikipedia for other
languages use different surface forms. This can be easily resolved but save
it for later.
Original issue reported on code.google.com by [email protected]
on 11 Oct 2008 at 1:10
What steps will reproduce the problem?
1. Consider any XML dump (with pages having an infobox) and run the XML Parser.
2. In the callback, print the infobox using getInfoBox() method.
3. The method is always returning null.
4. On printing the output of getText(), it can be observed that infobox is
getting printed with it.
What is the expected output? What do you see instead?
Expected output is the infobox for each page. "null" is given as output (always)
What version of the product are you using? On what operating system?
I am using r43 on Ubunt
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Jan 2012 at 10:59
What steps will reproduce the problem?
1. Parse the XML file,
2. Find that the Unicode encoding is plain wrong
Not sure how this wasn't noticed as a serious error before?
I'm not sure what I'm doing differently, that requires this, and how it
would work for others? Strange.
For me, adding "UTF8" as the encoding of the InputStreamReader, fixed
everything so the Unicode characters are read in correctly.
protected InputSource getInputSource() throws Exception
{
BufferedReader br = null;
if(wikiXMLFile.endsWith(".gz")) {
br = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream(wikiXMLFile)), "UTF8"));
} else if(wikiXMLFile.endsWith(".bz2")) {
FileInputStream fis = new FileInputStream(wikiXMLFile);
byte [] ignoreBytes = new byte[2];
fis.read(ignoreBytes); //"B", "Z" bytes from commandline tools
br = new BufferedReader(new InputStreamReader(
new CBZip2InputStream(fis), "UTF8"));
} else {
br = new BufferedReader(new InputStreamReader(
new FileInputStream(wikiXMLFile), "UTF8"));
}
return new InputSource(br);
}
Original issue reported on code.google.com by [email protected]
on 1 Apr 2010 at 4:37
What steps will reproduce the problem?
1. Language.urdu not available // Urdu support not available
2.
3.
What is the expected output? What do you see instead?
i need to extract infoboxes from XML file for Urdu language
What version of the product are you using? On what operating system?
latest version
Please provide any additional information below.
In xml file there's same structure as in English kindly help me regarding this
issue
Original issue reported on code.google.com by [email protected]
on 29 Jul 2013 at 11:09
It should be possible to emulate an iterator by registering callbacks. For
this purpose, a default handler called IteratorHandler should be
registered. This will require redesign of WikiPageIterator class.
Original issue reported on code.google.com by [email protected]
on 8 May 2009 at 6:13
What steps will reproduce the problem?
<text xml:space="preserve">#REDIRECT [[Albert Smith (footballer born
1900)]]</text>
The redirect page detection only covers the case "#REDIRECT ", but not
"#redirect"
For example:
<text xml:space="preserve">#redirect [[Vacuous truth]]</text>
<text xml:space="preserve">#redirect[[Hybrid vehicle]]</text>
<text xml:space="preserve">#Redirect [[Zhou Dynasty]]</text>
<text xml:space="preserve">#REDirect [[Glenn Medeiros]]</text>
One solution is to match the pattern using case insensitive.
What is the expected output? What do you see instead?
Should be able to handle the other case "#redirect"
What version of the product are you using? On what operating system?
r43, linux
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 2 Nov 2012 at 3:58
What steps will reproduce the problem?
1. Index enwiki-20081008-pages-articles.xml.bz2
2. Count down every page parsed
3. At page 210200 or so the parser throws the following exception:
java.lang.ArrayIndexOutOfBoundsException: 0
at edu.jhu.nlp.wikipedia.WikiTextParser.parseLinks(WikiTextParser.java:71)
at edu.jhu.nlp.wikipedia.WikiTextParser.getLinks(WikiTextParser.java:50)
at edu.jhu.nlp.wikipedia.WikiPage.getLinks(WikiPage.java:104)
at Test$1.process(Test.java:78)
at
edu.jhu.nlp.wikipedia.SAXPageCallbackHandler.endElement(SAXPageCallbackHandler.j
ava:42)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Abstract
SAXParser.java:604)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndEl
ement(XMLDocumentFragmentScannerImpl.java:1750)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentC
ontentDriver.next(XMLDocumentFragmentScannerImpl.java:2906)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentS
cannerImpl.java:624)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocum
entScannerImpl.java:116)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocum
ent(XMLDocumentFragmentScannerImpl.java:486)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:810)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:740)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:110)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXPa
rser.java:1208)
at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:47)
at Test.main(Test.java:103)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 5 Aug 2009 at 9:08
DOM Parser code runs out of memory for large dumps even in callback mode
despite of the "defer-node-expansion" option.
This is because the parse() method currently builds a Vector<WikiPage>
object. This can be avoided by a redesign of the WikiPageIterator class.
Original issue reported on code.google.com by [email protected]
on 8 May 2009 at 6:08
In class SAXPageCallbackHandler.characters()
The logic that is commented with the following comment ...
// TODO: To avoid looking at the revision ID, only the first ID is taken.
// I'm not sure how big the block size is in each call to characters(),
// so this may be unsafe.
... is indeed unsafe across buffer boundaries.
If the page id is split across two buffers then it is truncated.
So 450123 with the "450" starting at position 2045 of the ch[] and the "123"
starting at position 0 of ch[] of the next call will yield a result of "450".
Original issue reported on code.google.com by [email protected]
on 18 Nov 2012 at 1:55
Since, I have space issues, I want to parse the .bz2 file itself without
unzipping it.
I get the following error:
java -cp wikixmlj-r43.jar:.:bzip2.jar:.:xercesImpl-2.9.1.jar Test
enwiki-20130503-pages-articles-multistream.xml.bz2
[Fatal Error] :38:1: XML document structures must start and end within the same
entity.
org.xml.sax.SAXParseException: XML document structures must start and end
within the same entity.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:58)
at Test.main(Test.java:25)
Does wikixmlj parse .bz2 files or does it work only on uncompressed xml files??
Original issue reported on code.google.com by [email protected]
on 10 Jun 2013 at 7:02
Doing concat on the strings for SAXPageCallbackHandler, is amazingly slow.
Changing it to simply use a StringBuffer:
private StringBuffer currentWikitext;
private StringBuffer currentTitle;
and changing concat to do append,
I dunno. gives maybe a 40x increase in speed?
Original issue reported on code.google.com by [email protected]
on 2 Apr 2010 at 12:18
This is missing and should be added.
Original issue reported on code.google.com by [email protected]
on 6 Aug 2009 at 11:05
Hi,
I was looking at the source code from SAXPageCallbackHandler.java and noticed
that you do not reset the currentTag variable when processing the endElement()
function. This has the effect of adding characters outside any tag to the
previous tag.
best regards,
--ivo
Original issue reported on code.google.com by ivo.anastacio
on 30 Nov 2011 at 11:44
Once in a while, WikiTextParser (line 132) throws an exception. I assume this
is whenever the InfoBox isn't properly closed, so I guess it's not actually a
problem of WikiTextParser, but it could be handled more gracefully, e.g. by
testing that (endPos+1 < wikiText.length()).
Stacktrace:
java.lang.StringIndexOutOfBoundsException: String index out of range: 4877
at java.lang.String.substring(String.java:1934)
at edu.jhu.nlp.wikipedia.WikiTextParser.parseInfoBox(WikiTextParser.java:132)
at edu.jhu.nlp.wikipedia.WikiTextParser.getInfoBox(WikiTextParser.java:110)
at edu.jhu.nlp.wikipedia.WikiPage.getInfoBox(WikiPage.java:136)
Original issue reported on code.google.com by [email protected]
on 30 Jun 2010 at 9:08
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.