Giter Club home page Giter Club logo

cnn-dailymail's Introduction

This fork modifies the preprocessed output to JSON format to allow using non-Tensorflow libraries to work with the CNN/DailyMail summarization dataset

Note: requires Python 3

This fork is primarily developed in order to work with this repository which uses pytorch

--

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

Warning: These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story. The PyTorch code works fine with it unless in an extreme case such that all data sampled in a batch is empty.

2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3. Process into JSON files (packed into tarballs) and vocab_cnt files (python pickle)

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

This script will do several things:

  • The directories cnn_stories_tokenized and dm_stories_tokenized will be created and filled with tokenized versions of cnn/stories and dailymail/stories. This may take some time. Note: you may see several Untokenizable: warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them.
  • For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to tarball files train.tar, val.tar and test.tar. These will be placed in the newly-created finished_files directory. This may take some time.
  • Additionally, a vocab_cnt.pkl file is created from the training data. This is also placed in finished_files. This is a python Counter of all words, which could be useful for determining the vocabulary by word appearance count.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.