Giter Club home page Giter Club logo

twitter-tools's Introduction

Twitter Tools

This repo holds a collection of tools for the TREC Microblog tracks, which officially ended in 2015. The track mailing list can be found at [email protected].

Archival Documents

API Access

The Microblog tracks in 2013 and 2014 used the "evaluation as a service" (EaaS) model, where teams interact with the official corpus via a common API. Although the evaluation has ended, the API is still available for researcher use.

To request access to the API, follow these steps:

  1. Fill out the API usage agreement.
  2. Email the usage agreement to [email protected].
  3. After NIST receives your request, you will receive an access token from NIST.
  4. The code for accessing the API can be found in this repository. The endpoint of API itself (i.e., hostname, port) will be provided by NIST.

Getting Stated

The main Maven artifact for the TREC Microblog API is twitter-tools-core. The latest releases of Maven artifacts are available at Maven Central.

You can clone the repo with the following command:

$ git clone git://github.com/lintool/twitter-tools.git

Once you've cloned the repository, change directory into twitter-tools-core and build the package with Maven:

$ cd twitter-tools-core
$ mvn clean package appassembler:assemble

For more information, see the project wiki.

Replicating TREC Baselines

One advantage of the TREC Microblog API is that it is possible to deploy a community baseline whose results are replicable by anyone. The raw results are simply the output of the API unmodified. The baseline results are the raw results that have been post-processed to remove retweets and break score ties by reverse chronological order (earliest first).

To run the raw results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.raw.txt

And to run the baseline results for TREC 2011, issue the following command:

sh target/appassembler/bin/RunQueriesBaselineThrift \
 -host [host] -port [port] -group [group] -token [token] \
 -queries ../data/topics.microblog2011.txt > run.microblog2011.baseline.txt

Note that trec_eval is included in twitter-tools/etc (just needs to be compiled), and the qrels are stored in twitter-tools/data (just needs to be uncompressed), so you can evaluate as follows:

../etc/trec_eval.9.0/trec_eval ../data/qrels.microblog2011.txt run.microblog2011.raw.txt

Similar commands will allow you to replicate runs for TREC 2012 and TREC 2013. With trec_eval, you should get exactly the following results:

MAP raw baseline
TREC 2011 0.3050 0.3576
TREC 2012 0.1751 0.2091
TREC 2013 0.2044 0.2532
TREC 2014 0.3090 0.3924
P30 raw baseline
TREC 2011 0.3483 0.4000
TREC 2012 0.2831 0.3311
TREC 2013 0.3761 0.4450
TREC 2014 0.5145 0.6182

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work is supported in part by the National Science Foundation under award IIS-1218043. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the National Science Foundation.

twitter-tools's People

Contributors

aroegies avatar gtsherman avatar isoboroff avatar jimmy0017 avatar jinfengr avatar lintool avatar milesefron avatar myleott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter-tools's Issues

Java set to require too much memory in etc/run.sh

After trying to run.sh, the following error occurs:
$ etc/run.sh cc.twittertools.search.retrieval.RunQueryThrift -host ec2-107-22
-82-52.compute-1.amazonaws.com -port 9090 -queries data/queries.microblog2011.x
ml
Invalid maximum heap size: -Xmx4g
The specified size exceeds the maximum representable size.
Could not create the Java virtual machine.

This occurred on my Cygwin on my Windows 7 64bit machine.

This can be fixed by simply reducing Java memory from "java -Xmx4g" to "java -Xmx1g", etc, in etc/run.sh.

Windows binaries ?

Hi,
How can this code be compiled in windows?
Wouldn't be easier if the binaries for both Windows and Linux would be made available ? I've wasted so much time trying to this working.
Thanks,
Victor

MalformedJsonException forced end to indexing

Was running the indexer at HEAD in trec2013-api over the weekend on my version of the 2013 crawl. I struck an odd exception below.

13/05/11 04:24:47 INFO indexing.IndexStatuses: 173700000 statuses indexed
com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.Streams.parse(Streams.java:51)
at com.google.gson.JsonParser.parse(JsonParser.java:83)
at com.google.gson.JsonParser.parse(JsonParser.java:58)
at com.google.gson.JsonParser.parse(JsonParser.java:44)
at cc.twittertools.corpus.data.Status.fromJson(Status.java:112)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)
Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110)
at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:343)
at com.google.gson.Streams.parse(Streams.java:38)
... 7 more
com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.Streams.parse(Streams.java:51)
at com.google.gson.JsonParser.parse(JsonParser.java:83)
at com.google.gson.JsonParser.parse(JsonParser.java:58)
at com.google.gson.JsonParser.parse(JsonParser.java:44)
at cc.twittertools.corpus.data.Status.fromJson(Status.java:112)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)
Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed.
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110)
at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:343)

I haven't tracked this further to find the bundle it barfed on. I think this might be happening if we have a malformed tweet at the end of a block, but I don't see why that should happen. It left me an index after the crash, so I'll see if I can't make a test case.

com.twitter.corpus.demo classes don't compile

Looks like Lucene references. This will probably get fixed when we get ported to Lucene 4.

compile:
[javac] /home/soboroff/twitter-tools/build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 13 source files to /home/soboroff/twitter-tools/build
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:100: package IndexStatuses.StatusField does not exist
[javac] query.add(new BooleanClause(new TermQuery(new Term(IndexStatuses.StatusField.TEXT.name, qword)), BooleanClause.Occur.MUST));
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:105: package IndexStatuses.StatusField does not exist
[javac] new TermRangeFilter(IndexStatuses.StatusField.ID.name,
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/AndBaseline2011.java:115: package IndexStatuses.StatusField does not exist
[javac] hit.getField(IndexStatuses.StatusField.ID.name).stringValue(),
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:71: package IndexStatuses.StatusField does not exist
[javac] QueryParser qparser = new QueryParser(Version.LUCENE_31, IndexStatuses.StatusField.ID.name,
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:72: cannot find symbol
[javac] symbol : variable IndexStatuses
[javac] location: class com.twitter.corpus.demo.Ids2Dates
[javac] IndexStatuses.ANALYZER);
[javac] ^
[javac] /home/soboroff/twitter-tools/src/main/java/com/twitter/corpus/demo/Ids2Dates.java:85: package IndexStatuses.StatusField does not exist
[javac] out.println(line + " " + hit.getField(IndexStatuses.StatusField.CREATED_AT.name).stringValue());
[javac] ^
[javac] 6 errors

twitter collection 2011

I want to get the twitter collection 2011 dataset. when i run the AsyncHTMLStatusBlockCrawler class, the output file is alwals null. Should i configure the accessToken and accessTokenSecret, like twitter4j?

Store created_at as long

Currently, in the Lucene index, created_at is stored as a string. Changing this to an epoch ms (or something like that) as a long would save some index space.

Problem reading the corpus.

Hi,

I have crawled part of the data using your tool.
I had a problem with writeUTF() that cannot handle string over a certain length.
I changed the source code replacing writeUTF() with writebytes().
I now have a problem with readUTF() when I attempt to read the corpus.

I get the following error:

[14:49]tlargill@tamarin $ java -cp 'lib/*:dist/twitter-corpus-tools-0.0.1.jar' com.twitter.corpus.demo.ReadStatuses -input ../html/20110123-000.html.seq -dump -html
12/10/19 14:49:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
12/10/19 14:49:49 INFO compress.CodecPool: Got brand-new decompressor
28965131362770944 LovelyThang80 200 null null
28965131668946944 KoksalUgur 2020876334 null null
28965131803168769 renyfebry 538976316 null null
Exception in thread "main" java.io.UTFDataFormatException: malformed input around byte 2224
at java.io.DataInputStream.readUTF(DataInputStream.java:634)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at com.twitter.corpus.data.HtmlStatus.readFields(HtmlStatus.java:50)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1769)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1886)
at com.twitter.corpus.data.HtmlStatusBlockReader.next(HtmlStatusBlockReader.java:32)
at com.twitter.corpus.demo.ReadStatuses.main(ReadStatuses.java:99)

I also have a question regarding the fields of the dataset.
Is it normal that the value of the last two fields of the tweets I am able to read is always null?

Thanks

Spam filtering

Add Gord Cormack's logistic regression classifier. Write a simple CLI app to show random tweets and collect spam judgments to build a model. Then use the model to either cut tweets at index time, or filter at search time.

Implement service to return term counts

We need a service to return time counts within a certain interval. Need to decide:

  1. Actual implementation (separate service? squeeze into current service?)
  2. Granularity?
  3. Just unigrams? Arbitrary n-grams as well?
  4. Impact on efficiency?

Add Thrift API

Slap a Thrift API in front of Lucene to provide the beginnings of the official search API used for TREC 2013.

Fetch data from Tweets2011 Collection

Hello,

I'm trying to fetch the Tweets2011 according to the wiki page: https://github.com/lintool/twitter-tools/wiki/Tweets2011-Collection#fetching-a-status-block

I have a question regarding the example command:

sh target/appassembler/bin/AsyncHTMLStatusBlockCrawler \
   -data 20110123/20110123-000.dat -output json/20110123-000.json.

I got 20110123/20110123-000.dat does not exist! while running this command.

Where can I find those .dat files? I could not find any page that describes those files.

Decide on (Lucene) Analyzer

We need to decide as a community, how to handle tokenization, stemming, etc.

Suggest we proceed in the following:

  1. look at what the currently implementation is doing: give examples of how it's processing Twitter specifics such as hashtags, @-mentions, shortened URLs, etc.
  2. Send examples to track mailing list for people to comment on.
  3. Solicit feedback from the community.

stewhdcs is taking the lead on this.

Memory usage in IndexStatuses

IndexStatuses can OOM in the last stage, when it calls write.forceMerge(1). An OOM in this case destroys the index, perhaps this is due to the actions in the finally{} clause?

This should be more robust. stewdhcs suggested a custom merge policy in issue #17.

Extract Named Entities

Extract Named Entities and add them to the index.

I've found that the Stanford NER works reasonably well on tweets using the caseless models, so perhaps this is a reasonable solution.

HTML crawler broken

The JSON-from-HTML crawler is currently broken since Twitter stopped embedding the JSON in the HTML. The screen-scraping HTML crawler is currently broken due to page changes.

"Connection timed out" on client.search

Hi there,
Since yesterday I’ve been facing an error when using the API:
java.net.ConnectException: Connection timed out: connect
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at cc.twittertools.search.api.TrecSearchThriftClient.search(TrecSearchThriftClient.java:36)
at search.BaseSearch.main(BaseSearch.java:58)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:69)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:157)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 2 more
It was working fine before. Any suggestions?
Note: I haven’t switched to the newer version yet (could that be the problem?) and I am using the new hosts.

Thanks in advance,
Latifa

IndexStatuses OOM for very large collections (i.e. Tweets2013)

2013-04-17 07:16:46,041 [main] INFO IndexStatuses - 276300000 statuses indexed
2013-04-17 07:17:10,442 [main] INFO IndexStatuses - 276400000 statuses indexed
2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Total of 276485008 statuses added
2013-04-17 07:17:30,239 [main] INFO IndexStatuses - Merging segments...
java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot flush
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2908)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2901)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1645)
at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1621)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:145)

After this error, the destination directory is empty, so we have to start from scratch.

Solution 1: bump up JVM settings in etc/run.sh
Solution 2: avoid OOM better?

RM3 doesn't implement duplicate removal

Since the current RM3 implementation doesn't implement duplicate removal, trec_eval doesn't work on output (needs to be hand-hacked to delete duplicates).

Response format

We need to select in what format the results with be displayed, for instance, TREC, json, xml, etc.
A new option -format [trec|text|csv|tsv|json|xml] is then required.

Status.fromJson can fail, throwing an NPE

I found some statuses in my crawl that don't have the retweet_count field. This causes a NPE:
java.lang.NullPointerException
at cc.twittertools.corpus.data.Status.fromJson(Status.java:162)
at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44)
at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48)
at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138)

Sure enough, Status.fromJson tries to set the retweetCount by blindly following obj.get()'s result without checking it. Since it's a NPE I can only get the stack trace by running in jdb.

I'm happy to submit a patch to check this case, but I want to know -- is there any reason to expect any of these fields to be in a given status? If not, is there a cleaner way to wrap these checks than using all these try-catch blocks?

trec_eval problem at compile time

Hi,
I am on windows machine and I'm using gcc with cygwin64 in order to compile trec_eval. Nevertheless, I have an error about VERSIONID:

trec_eval.c:7:26: error: ‘VERSIONID’ undeclared here (not in a function)
static char *VersionID = VERSIONID;

So, I tried to comment VERSIONID and VersionID reference in the trec_eval.c file, but I still have an error (the error message is shown below).
Could you kindly hel me to figure out the problem.

Thanks in advance,
Giuseppe

Error log:

/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x5e1): riferimento non definito a "te_get_zscores"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x5e1): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_get_zscores"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x9d0): riferimento non definito a "te_convert_to_zscore"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x9d0): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_convert_to_zscore"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x1324): riferimento non definito a "te_get_zscores_cleanup"
/tmp/ccQKBOkL.o:trec_eval.c:(.text+0x1324): rilocazione adattata per troncamento: R_X86_64_PC32 contro il simbolo non definito "te_get_zscores_cleanup"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_form_inter_procs[.refptr.te_num_form_inter_procs]+0x0): riferimento non definito a "te_num_form_inter_procs"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_form_inter_procs[.refptr.te_form_inter_procs]+0x0): riferimento non definito a "te_form_inter_procs"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_trec_measure_nicknames[.refptr.te_num_trec_measure_nicknames]+0x0): riferimento non definito a "te_num_trec_measure_nicknames"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_trec_measure_nicknames[.refptr.te_trec_measure_nicknames]+0x0): riferimento non definito a "te_trec_measure_nicknames"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_trec_measures[.refptr.te_num_trec_measures]+0x0): riferimento non definito a "te_num_trec_measures"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_trec_measures[.refptr.te_trec_measures]+0x0): riferimento non definito a "te_trec_measures"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_results_format[.refptr.te_num_results_format]+0x0): riferimento non definito a "te_num_results_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_results_format[.refptr.te_results_format]+0x0): riferimento non definito a "te_results_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_num_rel_info_format[.refptr.te_num_rel_info_format]+0x0): riferimento non definito a "te_num_rel_info_format"
/tmp/ccQKBOkL.o:trec_eval.c:(.rdata$.refptr.te_rel_info_format[.refptr.te_rel_info_format]+0x0): riferimento non definito a "te_rel_info_format"
collect2: error: ld returned 1 exit status

Add auth mechanism

We need to add some sort of authentication mechanism so that only registered participants can access the API.

The test URL is access denied

I want to collect the twitter collection 2011. When I build the project with maven, the test URL in FetchStatusTest.java is access denied. However, I can visit this URL in my browser. What can I do with this problem?

Fetching status blocks for Tweets2011 - hits twitter api limit

I have been trying to get this tool to download the blocks from Tweets2011 collection, unfortunately current implementation hits the twitter api limit each time.

The Twitter limit on the read api is 180 hits per hour, see http://twitter4j.org/en/api-support.html, and 150 for unauthenticated.

I have tried

  • to create authenticated request to twitter 1.1 api (since older API is deprecated, and possible removed from March 2013 onwards)
  • parsing the content out of the web pages directly (a brittle solution!), however this doesn't work with protected accounts and missing pages

Given the number of requests generated by this solution, I am not sure how to build the Tweets2011 corpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.