Giter Club home page Giter Club logo

cynical-selection's Introduction

cynical-selection

Allo-media data selection tool

This code implements the data selection method and algorithms proposed in Axelrod's paper CYNICAL SELECTION OF LANGUAGE MODEL TRAINING DATA, based on the paper's explanations and the Perl implementation Axelrod proposed on github

Comments in code and details on usage to come, but it's pretty simple right now.

Basic usage

Say you have a (small) representative corpus (task.txt) and a (big) general one (unadapted.txt) and you want to select sentences from the big corpus that look like the small corpus ones.

Usage would be:

./cynical-selection.py --task task.txt --unadapted unadapted.txt

This will produce a .jaded file containing the selected sentences using the following tab-separated format:

model score sentence score (penalty + gain) length penalty sentence gain sentence id (in the selection) sentence id (in the unadapted corpus) best word word gain sentence.

See header of the script for available options, here is the two most important:

batch: essential with big corpora, allows to select more than one sentence at a time, see Axelrod's paper

iterate: iterate selection runs till no more than 10% of original size can be removed

cynical-selection's People

Contributors

antho-rousseau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cynical-selection's Issues

"KeyError: '@@'" raises while processing large corpora

Hello! I've been trying to select select data from a large corpus that includes 21M sentences using representative corpus with 100k sentences and met a "KeyError: '@@'" exception.
I ran the script with following parameters:

./cynical-selection.py --task ../ds/100k.google --unadapted ../ds/10M.os.en --no-lower --batch

Full text of the exception:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "./cynical-selection.py", line 695, in main_loop
unadapted_squish = squish_corpus(unadapted_data, replace)
File "./cynical-selection.py", line 297, in squish_corpus
squished.append(' '.join([replace[token] for token in line.split()]))
File "./cynical-selection.py", line 297, in
squished.append(' '.join([replace[token] for token in line.split()]))
KeyError: '@@'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./cynical-selection.py", line 819, in
main()
File "./cynical-selection.py", line 798, in main
selected = threading_wrapper(task_data, unadapted_data, args)
File "./cynical-selection.py", line 775, in threading_wrapper
zip(repeat(task_data), parts_list, repeat(args)))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 296, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
KeyError: '@@'

I also tried to run the program with a 10M general corpus but it didn't resolve the issue. It executes perfectly well on corpus with 2M sentences or less.

10M.os.en-100k.google-20190226_2152.log.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.