Giter Club home page Giter Club logo

dstc8-reddit-corpus's Introduction

dstc8-reddit

Reddit corpus construction code for the DSTC 8 Competition, Multi-Domain End-to-End Track, Task 2: Fast Adaptation.

See the DSTC 8 website, track proposal, and challenge homepage for more details.

This package is based on Luigi and downloads raw data from the 3rd party Pushshift repository.

Generating the Corpus

Requirements

  • Python 3.5+
  • ~210 GB space for constructing the dialogues with default settings
  • An internet connection
  • 24-72 hours to generate the data
    • Depends on speed of internet connection, how many cores, how much RAM
    • On a "beefy" machine with 16+ cores and 64GB+ RAM this should take under two days

Setup and Generation

  1. Modify run_dir in configs/config.prod.yaml to where you want all your data to be generated.
  2. Install the package with python setup.py install.
  3. Generate the data with python scripts/reddit.py generate.

Corpus Information

  • 1000 relatively non-toxic subreddits with over 75,000 subscribers each
  • 12 months of data, November 2017 to October 2018 (inclusive)
  • Up to two dialogues sampled per post, from different top-level comments
  • Additional splits for validation varying date and subreddits with respect to training set
  • Dialogues have at least 4 turns each
  • Filtering done on Reddit API fields, also bot-like content, etc.
  • No post processing done on the corpus. Our preprocessing code will be made public in our baseline model release
  • The final dataset zip is approximately 4.2 GB in size
Folder Total Dialogues
dstc8-reddit-corpus.zip:dialogues/training 5,085,113
dstc8-reddit-corpus.zip:dialogues/validation_date_in_domain_in 254,624
dstc8-reddit-corpus.zip:dialogues/validation_date_in_domain_out 1,278,998
dstc8-reddit-corpus.zip:dialogues/validation_date_out_domain_in 1,037,977
dstc8-reddit-corpus.zip:dialogues/validation_date_out_domain_out 262,036

Schema

The zip file is structured like this:

dstc8-reddit-corpus.zip:
  - dialogues/
    - training/                           # From [2017-11, ..., 2018-08] and 920 training subreddits
      - <subreddit>.txt
      ...
    - validation_date_in_subreddit_in/    # From [2017-11, ..., 2018-08] and 920 training subreddits
      # Dialogues are disjoint from those in training
      - <subreddit>.txt
      ...
    - validation_date_in_subreddit_out/   # From [2017-11, ..., 2018-08] and 80 held-out subreddits
      - <subreddit>.txt
      ...
    - validation_date_out_subreddit_in/   # From [2018-09, 2018-10] and 920 training subreddits
      - <subreddit>.txt
      ...
    - validation_date_out_subreddit_out/  # From [2018-09, 2018-10] and 80 held-out subreddits
      - <subreddit>.txt
      ...
  - tasks.txt                             # All subreddits
  - tasks_train.txt                       # Subreddits in the `subreddit_in` subsets
  - tasks_held_out.txt                    # Subreddits in the `subreddit_out` subsets

Each dialogues/<set> directory contains one file per subreddit, named for the subreddit e.g. dialogues/training/askreddit.txt.

Each dialogues file (e.g. dialogues/training/askreddit.txt) has one dialogue per line, encoded as stringified JSON with this schema:

{
    "id":       "...",  // md5 of the sequence of turn IDs comprising this dialogue
    "domain":   "...",  // subreddit name, lowercase
    "task_id": "...",   // first 8 chars of md5 of the lowercase subreddit name
    "bot_id": "",       // empty string, not valid for reddit
    "user_id": "",      // empty string, not valid for reddit
    "turns": [
        "...",
        ...
    ]
}

Here's an example of reading the data in Python:

with zipfile.ZipFile('dstc8-reddit-corpus.zip','r') as myzip:
    with io.TextIOWrapper(myzip.open('dialogues/training/askreddit.txt'), encoding='utf-8') as f:
        for line in f:
            dlg = json.loads(line)

Troubleshooting

Testing

You may want to download and subsample a single submissions and comments file from Pushshift to troubleshoot potential issues you may have. Alternatively you can reduce the date range by setting the manual_dates parameter in the config.yaml. E.g.

manual_dates:
  - "2018-02"

Memory errors

In case you hit your machine's memory limits, you may want to tweak the number of concurrently running tasks in your config.yaml. E.g.

max_concurrent_build: 6
max_concurrent_sample: 12

Dialogue construction and sampling are the most memory intensive.

Why does it take so long to download the data

Pushshift enforces a connection limit. In our experience any more than 4 connections per IP and you risk having your connections terminated.

We default to 4 concurrent connections at once, but if this is too much you can modify the config.yaml.

max_concurrent_downloads: 4

Too many open files

This shouldn't happen, but in case you get IOError: [Errno 24] Too many open files, try increasing the file open limit to something over a 1000 with ulimit -n 1000 or unlimited with ulimit -n unlimited (on Linux). See here for details.

I don't have enough disk space

Luigi is basically make for Python. It requires the targets from the last task exist to proceed with the next task - but not those previous. So say you've filtered all the submissions and comments - and are now building dialogues - you can delete the raw data if you wish.

The raw data takes up the most space (>144 GB) but also takes the longest time to obtain, so delete this with caution.

Filtering and building the dialogues discards a lot of the data, so only keeping things in the dialogues* directories is safe.

If you just want the final dataset you can use the --small option to delete raw and intermediate data the dataset is generated, e.g.

python scripts/reddit.py generate --small

Windows

This hasn't been thoroughly tested on Windows, but it's dependencies are entirely Python and as far as we know all supported on Linux, Mac OS, and Windows.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

dstc8-reddit-corpus's People

Contributors

aatkinson avatar microsoftopensource avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dstc8-reddit-corpus's Issues

Download script do not work - missing RS_* files

The reddit.py script does not work anymore, because the submission files (with filenames starting with RS) are no longer available on pushshift.io. Therefore it is not possible to download and generate the data.

checksum error while generating reddit dataset

Hi,
I tried to download and generate reddit dataset and followed "Setup and Generation" in readme.md.

When I run "python scripts/reddit.py generate --small",
I got error from this lines.

raise RuntimeError(f"Checksums don't match for {'RC' if self.filetype == 'comments' else 'RS'}_{self.date}!")

Then, I just deleted those two lines and got this error.

ERROR: [pid 52794] Worker Worker(salt=071717358, workers=36, host=storm, username=jglee, pid=45172) failed BuildDialogues(date=2017-11)
Traceback (most recent call last):
File "/home/jglee/anaconda3/envs/dstc8-baseline/lib/python3.7/site-packages/luigi-2.8.8-py3.7.egg/luigi/worker.py", line 184, in run
raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependency at run time: FilterRawSubmissions_2017_11_30e5e4658e

How can I solve this?
What am I doing wrong?

some error while generating the data

Hi,
I got some error while generating the data with script(python scripts/reddit.py generate)

I followed those description below.

Setup and Generation
Modify run_dir in configs/config.prod.yaml to where you want all your data to be generated.
Install the package with python setup.py install.
Generate the data with python scripts/reddit.py generate.

This is my error message.

missing_checksums={'RC_2018-06.xz': '01778d656253b5497769eeab36c8610a64b6f271fbe4065cdc21f5841faee530', 'RC_2018-07.xz': 'e703b5b95005d655283ae7149ee775a31534402fa801705fc771c32eea874781', 'RC_2018-08.xz': 'b8939ecd280b48459c929c532eda923f3a2514db026175ed953a7956744c6003', 'RC_2018-10.xz': 'cadb242a4b5f166071effdd9adbc1d7a78c978d3622bc01cd0f20d3a4c269bd0'}
delete_intermediate_data=False
ERROR: [pid 104057] Worker Worker(salt=761141828, workers=36, host=storm, username=jglee, pid=46329) failed DownloadRawFile(date=2018-05, filetype=comments)
Traceback (most recent call last):
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/luigi-2.8.8-py3.6.egg/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/luigi-2.8.8-py3.6.egg/luigi/worker.py", line 141, in run_get_new_deps
task_gen = self.task.run()
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/dstc8_reddit-0.1.0-py3.6.egg/dstc8_reddit/tasks/download.py", line 64, in run
raise RuntimeError(f"Checksums don't match for {'RC' if self.filetype == 'comments' else 'RS'}
{self.date}!")
RuntimeError: Checksums don't match for RC_2018-05!
ERROR: [pid 106791] Worker Worker(salt=761141828, workers=36, host=storm, username=jglee, pid=46329) failed ZipDataset()
Traceback (most recent call last):
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/luigi-2.8.8-py3.6.egg/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/luigi-2.8.8-py3.6.egg/luigi/worker.py", line 141, in _run_get_new_deps
task_gen = self.task.run()
File "/home/jglee/anaconda3/envs/reddit/lib/python3.6/site-packages/dstc8_reddit-0.1.0-py3.6.egg/dstc8_reddit/tasks/packaging.py", line 138, in run
archive = ZipFile(zip_path, 'w', compression=ZIP_DEFLATED, compresslevel=9)
TypeError: init() got an unexpected keyword argument 'compresslevel'

===== Luigi Execution Summary =====

Scheduled 5085 tasks of which:
* 5084 ran successfully:
- 12 BuildDialogues(date=2017-11,2017-12,2018-01,2018-02,...)
- 24 DownloadRawFile(date=2017-11, filetype=comments) ...
- 12 FilterRawComments(date=2017-11,2017-12,2018-01,2018-02,...)
- 12 FilterRawSubmissions(date=2017-11,2017-12,2018-01,2018-02,...)
- 5000 MergeDialoguesOverDates(split=training, subreddit=1200isplenty) ...
...
* 1 failed:
- 1 ZipDataset()

This progress looks :( because there were failed tasks

===== Luigi Execution Summary =====

And, This is my conda python library list.

(reddit) [jglee@storm dstc8-reddit-corpus]$ conda list
packages in environment at /home/jglee/anaconda3/envs/reddit:

Name Version Build Channel
_libgcc_mutex 0.1 main
ca-certificates 2019.5.15 1
certifi 2019.6.16 py36_1
chardet 3.0.4 pypi_0 pypi
click 7.0 pypi_0 pypi
dataclasses 0.6 pypi_0 pypi
docutils 0.15.2 pypi_0 pypi
idna 2.8 pypi_0 pypi
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
lockfile 0.12.2 pypi_0 pypi
luigi 2.8.8 pypi_0 pypi
ncurses 6.1 he6710b0_1
numpy 1.17.0 pypi_0 pypi
openssl 1.0.2s h7b6447c_0
pip 19.1.1 py36_0
pydantic 0.32.1 pypi_0 pypi
python 3.6.5 hc3d631a_2
python-daemon 2.1.2 pypi_0 pypi
python-dateutil 2.8.0 pypi_0 pypi
python-rapidjson 0.8.0 pypi_0 pypi
readline 7.0 h7b6447c_5
requests 2.22.0 pypi_0 pypi
setuptools 41.0.1 py36_0
six 1.12.0 pypi_0 pypi
sqlite 3.29.0 h7b6447c_0
tk 8.6.8 hbc83047_0
tornado 4.5.3 pypi_0 pypi
urllib3 1.25.3 pypi_0 pypi
wheel 0.33.4 py36_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3

Is there anyone who knows a solution?

Corpus can be generated only with Python 3.6+

Ubuntu 16.04, Python 3.5.2

The setup.py invoking fails with

  File "/tmp/easy_install-0119ery9/pydantic-0.29/setup.py", line 16
    self.links.add(f'.. _#{id}: https://github.com/samuelcolvin/pydantic/issues/{id}')
                                                                                    ^
SyntaxError: invalid syntax

due to f-strings. Maybe you need to fix Requirements section.

Checksums don't match for RC_2018-09 (while generating)

Hi,
I got this error while generating reddit-dataset.(python scripts/reddit.py generate)

ERROR: [pid 70798] Worker Worker(salt=701453871, workers=36, host=pixie, username=jglee, pid=68592) failed DownloadRawFile(date=2018-09, filetype=comments)
Traceback (most recent call last):
File "/home/jglee/anaconda3/envs/dstc8-baseline/lib/python3.7/site-packages/luigi-2.8.8-py3.7.egg/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/home/jglee/anaconda3/envs/dstc8-baseline/lib/python3.7/site-packages/luigi-2.8.8-py3.7.egg/luigi/worker.py", line 141, in run_get_new_deps
task_gen = self.task.run()
File "/home/jglee/anaconda3/envs/dstc8-baseline/lib/python3.7/site-packages/dstc8_reddit-0.1.0-py3.7.egg/dstc8_reddit/tasks/download.py", line 64, in run
raise RuntimeError(f"Checksums don't match for {'RC' if self.filetype == 'comments' else 'RS'}
{self.date}!")
RuntimeError: Checksums don't match for RC_2018-09!

How can i solve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.