Data Quran

This repository is collection of free dataset for everything related to Quran: from the text, translation, word-by-word, and tafsir.

Goals
Data Format
Repository Structure
Contributions
Legality
License

Goals

There are several reasons why this repository created:

To centralize all Quran dataset in one place.

Currently, to create Quran apps, developers need to gather data from various sources either by downloading, using API, or scraping manually. It would be nice if there is a single repository to gather them all.
To standardize the dataset format.

Each dataset source usually has their own format which means the developer need to parse and normalize each of them. It would be nice if all of those dataset only use a single type of formatting.
To archive dataset, in case the original source goes down or unreachable.

There are several useful Quranic website that went down after being inactive for several years. There are also cases where governments decided to ban Quranic apps from app stores. Hopefully this repository can be used as archive so those useful data doesn't vanish even after the original websites are gone.
To give proper attributions and explanation on how the data collected.

There are several other repositories that also collects the Quranic data. However, as far as I know all of them doesn't really mention the source and how the data are collected.

Data Format

Criteria

When choosing format for this repository, there are several criterias that must be fulfilled:

It must be usable across programming languages.
It must be platform agnostic, and doesn't require specific app to use.
It must supports multi-line text.
It must supports rich-text formatting and footnotes.
It must be easy to read and write even for non-programmers.
It must be usable with Git, and the diff must be easy to read.

Chosen Formats

There are two formats that used in this repository:

JSON
Markdown

JSON is the universal data format across all programming language. It's used for all Quranic data where the value are short, i.e. Quran metadata, surah, and word-by-word translation. The reason it's chosen are:

Every programming languages support JSON, and most of them include JSON parser and decoder in standard library.
It can be opened and edited by every text editor in every operating system.
The properly formatted JSON files are easy to read and write, even for common people.
Since it's just a text file, it's trackable using Git.

The only downside for JSON is we can't easily put multi-line or rich-text content as JSON value. While it's doable, it's not really easy to read and common people usually don't know how. This is why we only use it for Quranic data with short values.

Markdown is used for all Quranic data where the values are a long or multi line strings. This include the Quran text, translation, transliteration and tafseer. The reason it's chosen are:

Most programming languages have third-party library for encoding and decoding markdown languages.
It can be opened and edited by every text editor in every operating system.
It supports multi-line texts.
It supports rich-text formatting, and there is extension to make markdown supports footnotes.
It's easy to read and write.
It's also a text file, so it's trackable using Git.

The markdown files in this repository are formatted like this:

<!--
Comment block for license or metadata
-->

# [verse-id-1]

The content for this verse.

# [verse-id-2]

The content for this verse.

Rejected Formats

There are three formats that was considered, but eventually rejected:

Plain text
CSV
XML

Plain text was considered because it's used in Tanzil dataset. In this format, each verse only use one line, which make it compact and easy to read and write.

Pros:

It can be opened by every text editor.
It's easy enough to read and write.
It's trackable with Git.

Cons:

Since each verse only use one line, multi-line text is not supported.
It doesn't support rich-text formatting and footnotes.
We can force it to support multi-line and rich-text format by using HTML tags like <br>, <u>, and <b>. However, by doing so, now it's hard to read and write by common people (which remove the pros of this format).

The second candidate is CSV format, which was considered because it's used in QuranEnc dataset.

Pros:

It can be opened by every text editor and spreadsheet programs.
It's easy enough to read and write, especially when edited using spreadsheet programs.
It supports multi-line texts.
It's text file which make it trackable with Git.

Cons:

Default CSV symbols (i.e. separator and quote) differs depending on user locale. This could lead to problem when editing the file.
It doesn't support rich-text formatting and footnotes.
We can force it to support multi-line and rich-text format by using HTML tags. However, by doing so, now it's hard to read and write by common people.

The last format is XML, but it's immediately rejected because it's hard to read and write by common people.

Repository Structure

This repository is composed by several directories:

meta contains metadata that used in Quran.
surah contains Arabic name, data and ayah range for each surah in Quran.
surah-info contains descriptions and additional info for each surah in Quran.
surah-translation contains the translation from Arabic name of each surah.
ayah-text contains Arabic text that used in Quran.
ayah-tafsir contains additional explanation for each verse in Quran.
ayah-translation contains the translations for each verse in Quran.
ayah-transliteration contains transliteration from Arabic to Latin scripts for each verse. Useful for those starting to learn how to read Quran.
word contains id and position of each word in Quran.
word-text contains Arabic text for each word in Quran.
word-translation contains the translations for each word in Quran.
word-transliteration contains transliteration from Arabic to Latin scripts for each word.
source contains the explanation on where and how data in this repository collected.
cli contains Go application that used to download and generate data for this repository.

Contributions

Like other open source projects, we are open to suggestions and corrections. Feel free to submit your issues if there are any error in the dataset. However, there is a special rule for pull requests.

In this repository, every data scraped using cli from the sources using their official page or APIs. This is done to make sure data in this repository is same as the one in the original source. Therefore, any PR that want to modify the data will be rejected.

If you found a problem with the data, you should contact the original source and ask them to correct it. Once they make the correction upstream, we will update the data in this repository.

Legality

Data in this repository are collected from various sources, either by using the official download links, accessing their API, or scraping their web pages. Since scraping public information are considered legal in most countries, we hope this collection can be considered as fair use. Besides that, some source also have their own term of use which we try to fulfill.

License

This repository is available under CC BY-NC-ND 4.0 license. This means you can use this repository for free under following terms:

Attribution. You must give appropriate credit to this repository and provide a link to the license. Check out Creative Commons guide on how to give attribution.

If possible, please also include the original sources on your attribution. For example:

Data is taken from data-quran repository which licensed under CC BY-NC-ND 4.0 and collected by Hablullah team from various sources, e.g. Tanzil, QuranEnc, etc.
Non commercial. You may not use data from this repository for commercial purpose. This includes one-time purchase, subscription, in-app purchase, and in-app advertising. However you are allowed to ask donation for your apps, as long as it's not mandatory.
No derivatives. You are not allowed to publish derivative work from this repository. Derivative here means any modifications including translations, revisions, annotations, elaborations, or any other modifications that based on this repository.

If you have any modifications or revisions, you must submit it as pull request to this repository. This is done to make sure this repository stays as the single source of truth (SSOT) and to prevent confusions between multiple forks.

However, you are allowed to change data formats to make it suitable for your applications. So, even though this repository publish data in json and markdown format, you can safely convert it to SQL format. For more details, check out section 2.a.4 in license page and this FAQ from Creative Commons.

muhammed-rahif / data-quran Goto Github PK

data-quran's Introduction

Data Quran

Table of Contents

Goals

Data Format

Criteria

Chosen Formats

Rejected Formats

Repository Structure

Contributions

Legality

License

data-quran's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent