lintool / twitter-tools Goto Github PK

Twitter Tools

Python 4.58% Java 94.48% Perl 0.34% Thrift 0.24% PigLatin 0.32% Shell 0.04%

twitter-tools's Introduction

Twitter Tools

This repo holds a collection of tools for the TREC Microblog tracks, which officially ended in 2015. The track mailing list can be found at [email protected].

Archival Documents

API Access

The Microblog tracks in 2013 and 2014 used the "evaluation as a service" (EaaS) model, where teams interact with the official corpus via a common API. Although the evaluation has ended, the API is still available for researcher use.

To request access to the API, follow these steps:

Fill out the API usage agreement.
Email the usage agreement to [email protected].
After NIST receives your request, you will receive an access token from NIST.
The code for accessing the API can be found in this repository. The endpoint of API itself (i.e., hostname, port) will be provided by NIST.

Getting Stated

The main Maven artifact for the TREC Microblog API is twitter-tools-core. The latest releases of Maven artifacts are available at Maven Central.

You can clone the repo with the following command:

$ git clone git://github.com/lintool/twitter-tools.git

Once you've cloned the repository, change directory into twitter-tools-core and build the package with Maven:

$ cd twitter-tools-core
$ mvn clean package appassembler:assemble

For more information, see the project wiki.

Replicating TREC Baselines

One advantage of the TREC Microblog API is that it is possible to deploy a community baseline whose results are replicable by anyone. The raw results are simply the output of the API unmodified. The baseline results are the raw results that have been post-processed to remove retweets and break score ties by reverse chronological order (earliest first).

To run the raw results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.raw.txt

And to run the baseline results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesBaselineThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.baseline.txt

Note that trec_eval is included in twitter-tools/etc (just needs to be compiled), and the qrels are stored in twitter-tools/data (just needs to be uncompressed), so you can evaluate as follows:

../etc/trec_eval.9.0/trec_eval ../data/qrels.microblog2011.txt run.microblog2011.raw.txt

Similar commands will allow you to replicate runs for TREC 2012 and TREC 2013. With trec_eval, you should get exactly the following results:

MAP	raw	baseline
TREC 2011	0.3050	0.3576
TREC 2012	0.1751	0.2091
TREC 2013	0.2044	0.2532
TREC 2014	0.3090	0.3924

P30	raw	baseline
TREC 2011	0.3483	0.4000
TREC 2012	0.2831	0.3311
TREC 2013	0.3761	0.4450
TREC 2014	0.5145	0.6182

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work is supported in part by the National Science Foundation under award IIS-1218043. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the National Science Foundation.

twitter-tools's People

Contributors

Stargazers

Watchers

Forkers

wbqtac isoboroff eclayte oscar2d2 diegocaro joe-walsh pengkui robertpd yubink sherjilozair hijbul honzak88 duyvk fchesterman qiangrw gerhardgossen junskang maurodragoni sinorichard nigel-v-thomas wwwjscom jamesmcminn metasebya amjedbj sandeeppanem antoine-tran giangbinhtran ndkhoiits lzhx171 drtobbe youer lisadawn joylimbo benting musharafzialone wl-gao freedom9x mossaab0 chinna1986 igorcadelima toledobastos flaviomartins naveenhooda2000 jinfengr ktao s-mishra gtsherman cbuntain shrawanraina narayana1208 puneetsl ericalingyuan sdutheone chaoyangqq jiaul nkhuyu peimanb aroegies stephaniemak mahathi-bhagavatula doudou-z stboy-p imzwz garfielder007 qiuzhangcheng dadatawajue sandy4321 latuji anukat2015 khaledalbishre semanticbeeng queenofsumer mebigfatguy guaibaoer sobolsigizmund raoariel shadowridgedev suvadeep-iitb ss872 itsangona gaoyongbing gachet hvars kashenfelter rifal89 surefirelin nirban1996 anhtudotinfo iq-scm zeewangcio77

twitter-tools's Issues

trec_eval problem at compile time

Hi,
I am on windows machine and I'm using gcc with cygwin64 in order to compile trec_eval. Nevertheless, I have an error about VERSIONID:

trec_eval.c:7:26: error: ‘VERSIONID’ undeclared here (not in a function)
static char *VersionID = VERSIONID;

So, I tried to comment VERSIONID and VersionID reference in the trec_eval.c file, but I still have an error (the error message is shown below).
Could you kindly hel me to figure out the problem.

Thanks in advance,
Giuseppe

Error log:

/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x5e1): riferimento non definito a "te_get_zscores"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x5e1): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_get_zscores"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x9d0): riferimento non definito a "te_convert_to_zscore"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x9d0): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_convert_to_zscore"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x1324): riferimento non definito a "te_get_zscores_cleanup"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x1324): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_get_zscores_cleanup"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_form_inter_procs[.refptr.te_num_form_inter_procs]+0x0): riferimento non definito a "te_num_form_inter_procs"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_form_inter_procs[.refptr.te_form_inter_procs]+0x0): riferimento non definito a "te_form_inter_procs"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_trec_measure_nicknames[.refptr.te_num_trec_measure_nicknames]+0x0): riferimento non definito a "te_num_trec_measure_nicknames"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_trec_measure_nicknames[.refptr.te_trec_measure_nicknames]+0x0): riferimento non definito a "te_trec_measure_nicknames"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_trec_measures[.refptr.te_num_trec_measures]+0x0): riferimento non definito a "te_num_trec_measures"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_trec_measures[.refptr.te_trec_measures]+0x0): riferimento non definito a "te_trec_measures"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_results_format[.refptr.te_num_results_format]+0x0): riferimento non definito a "te_num_results_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_results_format[.refptr.te_results_format]+0x0): riferimento non definito a "te_results_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_rel_info_format[.refptr.te_num_rel_info_format]+0x0): riferimento non definito a "te_num_rel_info_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_rel_info_format[.refptr.te_rel_info_format]+0x0): riferimento non definito a "te_rel_info_format"
collect2: error: ld returned 1 exit status

Merge RM3 contribution back into master

Verify Miles' RM3 contribution and merge branch back into master.

What fields do you need to replicate your run?

For the moment, some few fields are avaiable through the search API:

id
created_at
username
text

What fileds do you need to replicate you run?

Extract Named Entities

Extract Named Entities and add them to the index.

I've found that the Stanford NER works reasonably well on tweets using the caseless models, so perhaps this is a reasonable solution.

Store created_at as long

Currently, in the Lucene index, created_at is stored as a string. Changing this to an epoch ms (or something like that) as a long would save some index space.

Implement service to return term counts

We need a service to return time counts within a certain interval. Need to decide:

Actual implementation (separate service? squeeze into current service?)
Granularity?
Just unigrams? Arbitrary n-grams as well?
Impact on efficiency?

Extract Entities form tweet text

Extract hashtags, mentions and URLs from text.

IndexStatuses OOM for very large collections (i.e. Tweets2013)

2013-04-17 07:16:46,041 [main] INFO IndexStatuses - 276300000 statuses indexed
2013-04-17 07:17:10,442 [main] INFO IndexStatuses - 276400000 statuses indexed
2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Total of 276485008 statuses added
2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Merging segments...
java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot flush
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2908)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2901)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1645)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1621)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:145)

After this error, the destination directory is empty, so we have to start from scratch.

Solution 1: bump up JVM settings in etc/run.sh
Solution 2: avoid OOM better?

com.twitter.corpus.demo classes don't compile

Looks like Lucene references. This will probably get fixed when we get ported to Lucene 4.

compile:
[javac] /home/soboroff/twitter-tools/build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 13 source files to /home/soboroff/twitter-tools/build
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:100: package IndexStatuses.StatusField does not exist
[javac] query.add(new BooleanClause(new TermQuery(new Term(IndexStatuses.StatusField.TEXT.name, qword)), BooleanClause.Occur.MUST));
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:105: package IndexStatuses.StatusField does not exist
[javac] new TermRangeFilter(IndexStatuses.StatusField.ID.name,
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:115: package IndexStatuses.StatusField does not exist
[javac] hit.getField(IndexStatuses.StatusField.ID.name).stringValue(),
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:71: package IndexStatuses.StatusField does not exist
[javac] QueryParser qparser = new QueryParser(Version.LUCENE_31, IndexStatuses.StatusField.ID.name,
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:72: cannot find symbol
[javac] symbol : variable IndexStatuses
[javac] location: class com.twitter.corpus.demo.Ids2Dates
[javac] IndexStatuses.ANALYZER);
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:85: package IndexStatuses.StatusField does not exist
[javac] out.println(line + " " + hit.getField(IndexStatuses.StatusField.CREATED_AT.name).stringValue());
[javac] ^
[javac] 6 errors

Update README.md for instructions on how to stream crawler

Update README.md for instructions on how to stream crawler. This is a blocker for the release of the stream crawler to general TREC participants.

"Connection timed out" on client.search

Hi there,
Since yesterday I’ve been facing an error when using the API:
java.net.ConnectException: Connection timed out: connect
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at cc.twittertools.search.api.TrecSearchThriftClient.search(TrecSearchThriftClient.java:36)
at search.BaseSearch.main(BaseSearch.java:58)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:69)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:157)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 2 more
It was working fine before. Any suggestions?
Note: I haven’t switched to the newer version yet (could that be the problem?) and I am using the new hosts.

Thanks in advance,
Latifa

MalformedJsonException forced end to indexing

Was running the indexer at HEAD in trec2013-api over the weekend on my version of the 2013 crawl. I struck an odd exception below.

13/05/11 04:24:47 INFO indexing.IndexStatuses: 173700000 statuses indexed
com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.Streams.parse(Streams.java:51)
at com.google.gson.JsonParser.parse(JsonParser.java:83)
at com.google.gson.JsonParser.parse(JsonParser.java:58)
at com.google.gson.JsonParser.parse(JsonParser.java:44)
at cc.twittertools.corpus.data.Status.fromJson(Status.java:112)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)
Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110)
at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:343)
at com.google.gson.Streams.parse(Streams.java:38)
... 7 more
com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.Streams.parse(Streams.java:51)
at com.google.gson.JsonParser.parse(JsonParser.java:83)
at com.google.gson.JsonParser.parse(JsonParser.java:58)
at com.google.gson.JsonParser.parse(JsonParser.java:44)
at cc.twittertools.corpus.data.Status.fromJson(Status.java:112)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)
Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110)
at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:343)

I haven't tracked this further to find the bundle it barfed on. I think this might be happening if we have a malformed tweet at the end of a block, but I don't see why that should happen. It left me an index after the crash, so I'll see if I can't make a test case.

Upgrade to Lucene 4

Upgrade to Lucene 4, which supports goodies such as BM25, LM, etc.

HTML crawler broken

The JSON-from-HTML crawler is currently broken since Twitter stopped embedding the JSON in the HTML. The screen-scraping HTML crawler is currently broken due to page changes.

Using the Collection with Terrier

Hello,

I'm trying to index the corpus using Terrier. But according to
http://ir.dcs.gla.ac.uk/wiki/Terrier/Tweets11, I need the collection
in JSON format first. I'm using the HTML collection

Where do I find the HTML scrapper that the page mentions to write out
the collection in JSON format? And how would I go about using it?

Thank you,

The test URL is access denied

I want to collect the twitter collection 2011. When I build the project with maven, the test URL in FetchStatusTest.java is access denied. However, I can visit this URL in my browser. What can I do with this problem?

Fetching status blocks for Tweets2011 - hits twitter api limit

I have been trying to get this tool to download the blocks from Tweets2011 collection, unfortunately current implementation hits the twitter api limit each time.

The Twitter limit on the read api is 180 hits per hour, see http://twitter4j.org/en/api-support.html, and 150 for unauthenticated.

I have tried

to create authenticated request to twitter 1.1 api (since older API is deprecated, and possible removed from March 2013 onwards)
parsing the content out of the web pages directly (a brittle solution!), however this doesn't work with protected accounts and missing pages

Given the number of requests generated by this solution, I am not sure how to build the Tweets2011 corpus.

Memory usage in IndexStatuses

IndexStatuses can OOM in the last stage, when it calls write.forceMerge(1). An OOM in this case destroys the index, perhaps this is due to the actions in the finally{} clause?

This should be more robust. stewdhcs suggested a custom merge policy in issue #17.

Decide on (Lucene) Analyzer

We need to decide as a community, how to handle tokenization, stemming, etc.

Suggest we proceed in the following:

look at what the currently implementation is doing: give examples of how it's processing Twitter specifics such as hashtags, @-mentions, shortened URLs, etc.
Send examples to track mailing list for people to comment on.
Solicit feedback from the community.

stewhdcs is taking the lead on this.

Status.fromJson can fail, throwing an NPE

I found some statuses in my crawl that don't have the retweet_count field. This causes a NPE:
java.lang.NullPointerException
at cc.twittertools.corpus.data.Status.fromJson(Status.java:162)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)

Sure enough, Status.fromJson tries to set the retweetCount by blindly following obj.get()'s result without checking it. Since it's a NPE I can only get the stack trace by running in jdb.

I'm happy to submit a patch to check this case, but I want to know -- is there any reason to expect any of these fields to be in a given status? If not, is there a cleaner way to wrap these checks than using all these try-catch blocks?

Windows binaries ?

Hi,
How can this code be compiled in windows?
Wouldn't be easier if the binaries for both Windows and Linux would be made available ? I've wasted so much time trying to this working.
Thanks,
Victor

RM3 doesn't implement duplicate removal

Since the current RM3 implementation doesn't implement duplicate removal, trec_eval doesn't work on output (needs to be hand-hacked to delete duplicates).

Problem reading the corpus.

Hi,

I have crawled part of the data using your tool.
I had a problem with writeUTF() that cannot handle string over a certain length.
I changed the source code replacing writeUTF() with writebytes().
I now have a problem with readUTF() when I attempt to read the corpus.

I get the following error:

[14:49]tlargill@tamarin $ java -cp 'lib/*:dist/twitter-corpus-tools-0.0.1.jar' com.twitter.corpus.demo.ReadStatuses -input ../html/20110123-000.html.seq -dump -html
12/10/19 14:49:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
28965131362770944 LovelyThang80 200 null null
28965131668946944 KoksalUgur 2020876334 null null
28965131803168769 renyfebry 538976316 null null
Exception in thread "main" java.io.UTFDataFormatException: malformed input around byte 2224
at java.io.DataInputStream.readUTF(DataInputStream.java:634)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at com.twitter.corpus.data.HtmlStatus.readFields(HtmlStatus.java:50)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1769)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1886)
at com.twitter.corpus.data.HtmlStatusBlockReader.next(HtmlStatusBlockReader.java:32)
at com.twitter.corpus.demo.ReadStatuses.main(ReadStatuses.java:99)

I also have a question regarding the fields of the dataset.
Is it normal that the value of the last two fields of the tweets I am able to read is always null?

Thanks

Add auth mechanism

We need to add some sort of authentication mechanism so that only registered participants can access the API.

About writeUTF() in com.twitter.corpus.data.HtmlStatus

I have a question that when using html method to fetch data. If the data beyound to 65535, the function named writeUTF will throw a exception. I wonder that is that a bug? or just I misuse this program?

Java set to require too much memory in etc/run.sh

After trying to run.sh, the following error occurs:
$ etc/run.sh cc.twittertools.search.retrieval.RunQueryThrift -host ec2-107-22
-82-52.compute-1.amazonaws.com -port 9090 -queries data/queries.microblog2011.x
ml
Invalid maximum heap size: -Xmx4g
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine.

This occurred on my Cygwin on my Windows 7 64bit machine.

This can be fixed by simply reducing Java memory from "java -Xmx4g" to "java -Xmx1g", etc, in etc/run.sh.

twitter collection 2011

I want to get the twitter collection 2011 dataset. when i run the AsyncHTMLStatusBlockCrawler class, the output file is alwals null. Should i configure the accessToken and accessTokenSecret, like twitter4j?

Fetch data from Tweets2011 Collection

Hello,

I'm trying to fetch the Tweets2011 according to the wiki page: https://github.com/lintool/twitter-tools/wiki/Tweets2011-Collection#fetching-a-status-block

I have a question regarding the example command:

sh target/appassembler/bin/AsyncHTMLStatusBlockCrawler \
   -data 20110123/20110123-000.dat -output json/20110123-000.json.

I got 20110123/20110123-000.dat does not exist! while running this command.

Where can I find those .dat files? I could not find any page that describes those files.

lintool / twitter-tools Goto Github PK

twitter-tools's Introduction

Twitter Tools

Archival Documents

API Access

Getting Stated

Replicating TREC Baselines

License

Acknowledgments

twitter-tools's People

Contributors

Stargazers

Watchers

Forkers

twitter-tools's Issues

Recommend Projects

Recommend Topics

Recommend Org