Giter Club home page Giter Club logo

farstail's Introduction

FarsTail: a Persian natural language inference dataset

alt-text

Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the inference relationship between a premise p and a hypothesis h. It is a three-class problem where each pair (p, h) is assigned to one of these classes: "ENTAILMENT" if the hypothesis can be inferred from the premise, "CONTRADICTION" if the hypothesis contradicts the premise, and "NEUTRAL" if none of the above holds.
There are large datasets such as SNLI, MNLI, and SciTail for NLI in English, but there are few datasets for poor-data languages like Persian.
Persian (Farsi) language is a pluricentric language spoken by around 110 million people in countries like Iran, Afghanistan, and Tajikistan. Here, we present the first relatively large-scale Persian dataset for NLI task, called FarsTail. A total of 10,367 samples are generated from a collection of 3,539 multiple-choice questions. The train, validation, and test portions include 7,266, 1,537, and 1,564 instances, respectively. Please refer to the manuscript for more details.

Reading data

To read the raw data in Persian alphabet, use the following code:

train_data = pd.read_csv('data/Train-word.csv', sep='\t')
val_data = pd.read_csv('data/Val-word.csv', sep='\t')
test_data = pd.read_csv('data/Test-word.csv', sep='\t')

The train_data and val_data have three columns, premise, hypothesis, and label. The test_data has two more columns denoted as hard(hypothesis) and hard(overlap) which indicate whether or not each sample belongs to the hard subset based on the hypothesis-only and overlap-based biased models, respectively.

Non-Persian researchers can use the following code to read the indexed data:

with np.load('data/Indexed-FarsTail.npz', allow_pickle=True) as f:
    train_ind, val_ind, test_ind, dictionary = f['train_ind'], f['val_ind'], f['test_ind'], f['dictionary'].item()

The train_ind and val_ind are numpy arrays with the shape of (n, 3) where n is the number of samples in each set. Each entry in these arrays includes the tokenized, indexed version of the premise and hypothesis along with the respective label for one instance. The entries of test_ind variable have two more elements corresponding to the hard(hypothesis) and hard(overlap) columns, respectively. The dictionary variable maps the indexes to tokens.

Results

Here is test accuracies obtained by training some models on FarsTail training set. Please refer to the manuscript for more results.

Model Test Accuracy Hypothesis-only (Easy) Hypothesis-only (Hard) Overlap-based (Easy) Overlap-based (Hard)
DecompAtt (word2vec) 0.6662 0.7341 0.5823 0.7633 0.5404
HBMP (word2vec) 0.6604 0.7618 0.5350 0.7565 0.5360
ESIM (fastText) 0.7116 0.7931 0.6109 0.8120 0.5815
mBERT 0.8338 0.8763 0.7811 0.8981 0.7504

Reference

If you use this dataset, please cite the following paper:

Hossein Amirkhani, Mohammad AzariJafari, Soroush Faridan-Jahromi, Zeinab Kouhkan, Zohreh Pourjafari, Azadeh Amirak (2023). FarsTail: a Persian natural language inference dataset. Soft Computing.

@article{amirkhani2023farstail,
  title={FarsTail: a Persian natural language inference dataset},
  author={Amirkhani, Hossein and AzariJafari, Mohammad and Faridan-Jahromi, Soroush and Kouhkan, Zeinab and Pourjafari, Zohreh and Amirak, Azadeh},
  journal={Soft Computing},
  year={2023},
  publisher={Springer},
  doi={10.1007/s00500-023-08959-3}
}

farstail's People

Contributors

azarijafari avatar hossein-amirkhani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

farstail's Issues

LICENSE

@h-amirkhani @m-azari nice work! ๐Ÿ‘

We're interested in using your data, and hence, we want to check in with you about its license.
Could you please add the license terms to your github repository?
A more permissive license that allows us to modify+redistribute your dataset (e.g., Apache 2.0) would work well for us.

FYI @armancohan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.