darija-open-dataset / dataset Goto Github PK

I've been looking through the sentence.csv and found out that they are some typos in some sentences like using "8" as " h" or "7", and also they are some sentences that can have a variant of translated sentences for example: " kaychrab kass dyal lma" as " kaychrb kass dlma"
In addition, the use of "x" instead of "ch".

Translating 17.csv

Translating and fixing typos in 17.csv.

Translating 37.csv

Translating and fixing typos in 37.csv.

Translating 21.csv

Translating and fixing typos in 21.csv.

How to know if a word or a sentence is already added

I want to start contributing to DODA and been exploring the repo, and I am wondering how you do identify if a sentence or a word is already added to the dataset.

Translating 30.csv

Translating and fixing typos in 30.csv.

Translating 23.csv

Translating and fixing typos in 23.csv.

Translating 1.csv

Translating and fixing typos in 01.csv.

sentences2.csv makhdamch

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError Traceback (most recent call last)
in
1 import pandas as pd
2
----> 3 print(pd.read_csv("dataset/sentences2.csv"))

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

o fach kandir rror_bad_lines=False kaitl3lya hdchi

Translating 31.csv

Translating and fixing typos in 31.csv.

Translating 3.csv

Translating and fixing typos in 03.csv.

Translating 20.csv

Translating and fixing typos in 20.csv.

Translating 41.csv

Translating and fixing typos in 41.csv.

Translating 34.csv

Translating and fixing typos in 34.csv.

Translating 28.csv

Translating and fixing typos in 28.csv.

Translating 42.csv

Translating and fixing typos in 42.csv.

Translating 9.csv

Translating and fixing typos in 09.csv.

Translating 6.csv

Translating and fixing typos in 06.csv.

Translating 35.csv

Translating and fixing typos in 35.csv.

Hosting the dataset on DataStack?

Hello!

I enjoy seeing community data projects like this. As a former student of Cantonese (a dialect of Chinese spoken in HongKong), I know what it's like when the language you're interested in has no dictionary. At one point, I tried to create one myself, but gave up after 500 words. It's too much effort to do it alone.

The issue today is, that working together on these things is very technical. I'm sure your contributors here are familiar with Github and have no problem helping out, but it would be great if anyone with just the knowledge of English and Darija could help out, without technical skills.

For this reason I am building DataStack. It's a collaboration platform for table data, and works similarly to Github, but much more easy to use for data. No tools or technical knowledge needed.

To show you what it looks like, I've uploaded two files from your dataset there:

https://datastack.net/boukeversteegh/darija-open-dataset-demo/

If you're interested, please have a look. Does this solution suit your needs? What is still missing in your opinion?

darija-open-dataset / dataset Goto Github PK

dataset's Issues

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

Recommend Projects

Recommend Topics

Recommend Org