c-3lab / dim Goto Github PK
View Code? Open in Web Editor NEW📦 dim: Manage the open data in your project like a package manager.
License: MIT License
📦 dim: Manage the open data in your project like a package manager.
License: MIT License
Thank you for creating this project :)
Data installation manager is absolutely required for open source community. I faced some difficulties when developing a dataset and a data analysis tool with Python regarding COVID-19.
Is it possible to add Python (+R?) library to simplify the interaction with dim?
(I'm not sure we can call Deno from Python...)
Users may use the new library as follows.
poetry add dim-python
[tool.dim]
directory = './data_files'
datasets = [
{
name = 'example',
url = 'https://example.com',
unzip = true,
forced = true,
encoding = 'utf-8',
postprocess = ["poetry run python ./tests/test_custom_command.py",],
},
]
Update datasets with poetry run dim update
, or
Update/load the dataset with Python scripts.
import dim
dim.config(settings='./data_files/dim-lock.json')
data = dim.load(name='example')
I'm just a new user, but very interested in this project.
Delete only data_files.
To delete all, rm and init.
Install the data from the other dim.json
case1
dim install -f ./file/path/dim.json
case2
dim install -f https://github/xxx/xxxx/master/xxxx/dim.json
たとえば、ダウンロードしてきたファイルをまず unzip し、次に sjis -> utf-8 に変換する、というように post-processing を2つ以上指定することはできますか?
もし現状できなければ、新機能として提案し PR を作りたいと思います。
国土地理院の自然災害伝承碑データをダウンロードしたくてその必要性を感じました。
dim update simply updates the data file, and postProcesses, headers changes the policy to use the previously specified contents.
Lines 33 to 36 in c546fdc
この部分について、
エラーが発生すると、"Request failed with status code 404 Not Found"
のように表示されますが、これは ky のエラーメッセージを表示しており、open AI 側のレスポンスのエラーメッセージ(例えば以下を参照)が表示されていないようです。
コマンド例
curl -i https://api.openai.com/v1/completions -H \
"Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"prompt": "generate the code that prints the following messange in python: this is a test",
"max_tokens": 7,
"temperature": 0
}'
出力
HTTP/2 404
date: Thu, 15 Jun 2023 12:13:02 GMT
content-type: application/json
content-length: 227
access-control-allow-origin: *
openai-organization: albert-inc-1
openai-processing-ms: 268
openai-version: 2020-10-01
strict-transport-security: max-age=15724800; includeSubDomains
x-ratelimit-limit-requests: 3500
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-requests: 3499
x-ratelimit-remaining-tokens: 89992
x-ratelimit-reset-requests: 17ms
x-ratelimit-reset-tokens: 4ms
x-request-id: 9bdd4ddbcedcce031c323eb331076f87
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 7d7ab9887db0afdb-NRT
alt-svc: h3=":443"; ma=86400
{
"error": {
"message": "This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?",
"type": "invalid_request_error",
"param": "model",
"code": null
}
}
このメッセージの"message": "This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?"を表示したいです。
dim.json
{
"fileVersion": "1.1",
"contents": [{
"name": "xxxxxxx", // install時に指定したname 指定しなかった場合はURL
"url": "https://xxxx.xxx.xx", //install時に指定したurl
"catalogUrl": "https://ckan.xxx.xx", // search -i で取得した場合は packageのカタログURLを保管 それ以外の場合はnull
"catalogResourceId": "123456abcd", // search -i で取得した場合は resourceのidを保管 それ以外の場合はnull
"postProcesses": [
{ "type": "unzip", "arguments": { "password": "dummy", ... } },
"csv_to_json"
], // install時に指定したpost_process 文字列かObject
"headers": { "Fiware-Service": "servicce1" }, // install時に指定したheader key:value形式
}]
}
dim-lock.json
{
"lockfileVersion": "1.1",
"contents": [{
"name": "xxxxxxx", // install時に指定したname 指定しなかった場合はURL
"url": "https://xxxx.xxx.xx", //install時に指定したurl
"path": "xxx/xxx/xx.json" // installした際の保存先
"catalogUrl": ""https://ckan.xxx.xx"", // search -i で取得した場合は packageのカタログURLを保管 それ以外の場合はnull
"catalogResourceId": "123456abc", // search -i で取得した場合は resourceのidを保管 それ以外の場合はnull
"lastModified": "2022-07-06T02:28:06.556Z", // 取得するデータのResponse headerのlast_modifiedから取得 フォーマットはISO8601 取得できない場合はnull
"eTag": "xxx-xxxxx", // 取得するデータのResponse headerのe-tagから取得 取得できない場合はnull 提供されるデータの変更確認などに使用
"lastDownloaded": "2022-07-06T02:28:06.556Z", //ダウンロードを実施した時刻 旧last_updatedから変更 フォーマットはISO8601
"integrity": "sha1-xxxxxxxx", // npmのintegrityを参考 ダウンロードしてきた時点でのファイルのハッシュ化(sha1)を行う ダウンロード後のファイル変更確認などに使用
"postProcesses": [
{ "type": "unzip", "arguments": { "password": "dummy", ... } },
"csv_to_json"
], // install時に指定したpost_process 文字列かObject
"headers": { "Fiware-Service": "service1" }, // install時に指定したheader key:value形式
}]
}
$ dim install -h "Fiware-Service: xxxx" http://xxxxxxxx
Currently, downloaded file is created at data_files/{name}/{filename}
.
If URL is matched with following patterns, current logic can't get filename and occur an error.
$ dim install https://www.example.com -n example1
Failed to install. Is a directory (os error 21), open './data_files/example1/'
Use Content-Disposition
response header to determine filename.
And fallback to use --name
option as filename if don't serve it.
The version of Deno was updated to 1.26.0 on 2022.09.28.
Subsequently, the following error occurred in the Check type of CI.
error: TS1477 [ERROR]: An instantiation expression cannot be followed by a property access.
() => Promise<number>.resolve(4),
~~~~~~~~
at file:///home/runner/work/dim/dim/tests/libs/actions.search.test.ts:451:24
If the install
command is executed with unzip
in the -p
option, the unzipped file is not generated in the data_file as in xlsx-to-csv
, but in the current directory.
Change it so that it is generated in the same directory as the file before the change, as in xlsx-to-csv
.
Use as a reference.
https://docs.npmjs.com/cli/v8/configuring-npm/npmrc
dimインストール中にTS1192が発生するようになってしまいました。(再現性あり)
error: TS1192 [ERROR]: Module '"https://jspm.dev/xlsx.js"' has no default export.
import xlsxlib from 'https://jspm.dev/xlsx'
~~~~~~~
at https://deno.land/x/[email protected]/src/xlsx.ts:1:8
# deno --version
deno 1.19.1 (release, x86_64-unknown-linux-gnu)
v8 9.9.115.7
typescript 4.5.2
# git for-each-ref
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/heads/main
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/remotes/origin/HEAD
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/remotes/origin/main
8c54c493a4588103a9f715812e6f7dad467a9853 commit refs/tags/v0.1.3
5f4309e2f10008ad01582ccbc92ec7327a858df8 commit refs/tags/v0.1.4
b3092d7d4a9208437c60cae05de43a747503fbea commit refs/tags/v0.1.5
以下のような新しいubuntuコンテナの操作で再現しました。
$ sudo docker run -it --rm ubuntu /bin/bash
以下、コンテナ内操作。
# apt update; apt upgrade
# apt install git curl unzip
# curl -fsSL https://deno.land/install.sh | sh
# echo 'export DENO_INSTALL="/root/.deno"' >> ~/.bashrc
# echo 'export PATH="$DENO_INSTALL/bin:$PATH"' >> ~/.bashrc
# source ~/.bashrc
# git clone [email protected]:ryo-ma/dim.git
# cd dim
# deno install --unstable --allow-read --allow-write --allow-run --allow-net dim.ts
Download https://cdn.skypack.dev/encoding-japanese
Download https://deno.land/[email protected]/fmt/colors.ts
Download https://deno.land/[email protected]/fs/mod.ts
...
Download https://deno.land/x/[email protected]/src/xlsx-types.ts
Download https://jspm.dev/xlsx
...
Check file:///root/dim/dim.ts
error: TS1192 [ERROR]: Module '"https://jspm.dev/xlsx.js"' has no default export.
import xlsxlib from 'https://jspm.dev/xlsx'
~~~~~~~
at https://deno.land/x/[email protected]/src/xlsx.ts:1:8
Environment variable
DIM_FILE_PATH=./file_path
CLI Option(install, uninstall, update, list, search)
dim install https://xxxxxx -n example --prefix ./file_path
$ dim search -i xxxx
package_title1
- package_url
- package_description
- package_license
1.resource_name1
* resource_url1
* resource_description1
* created1
* format
2.resource_name2
* resource_url2
* resource_description2
* created2
* format
package_title2
- package_url
- package_description
- package_license
3.resource_name3
* resource_url3
* resource_description3
* created3
* format
4.resource_name4
* resource_url4
* resource_description4
* created4
* format
...
Enter the number of the data to install
> 1
Enter the name. Enter blank if not required.
>
Enter the post-processing you want to add separated by spaces.
Enter blank if not required.
(ex.: > unzip xlsx-to-csv)
> unzip
installing...
unzip
Installed to /xxx/xxx
I tried to replace the OS information with the following code,
but it fails because it is readonly.
const denoBuildOsStub = stub(Deno.build, "os");
dim search "keyword" -n 10 --type json
If older versions of dim.json and dim-lock.json exist, compare the version numbers, and if differences are found, an error is generated.
It is necessary to check for fileVersion / lockFileVersion in constructor.
Similar to the following issue.
#54
When name is specified and update is performed on a single data, the presence or absence of the -A option has no effect on the operation.
The operation of the following two commands is identical.
dim update name
dim update name -A
Disallow the name and -A option to be specified at the same time.
">" is not recognized as a redirect sign.
The deno.run command probably treats ">" as a string.
A single code can be used in a variety of environments.
Have to handle complex file names and redirects yourself. (>>, 2>>, 1>&2, etc.)
If -p "cmd wc -c > /tmp/test.txt" is specified, start using /bin/sh as follows.
Deno.run({ cmd: ["/bin/sh", "-c", "wc -c data_files/xxx/xxx.zip > /tmp/test.txt"]})
/bin/sh handles redirects, so no need to implement your own processing.
If the function to send downloaded files as standard input is implemented, the string received with the -p option can be used as is.
Deno.run({ cmd: ["/bin/sh", "-c", "wc -c > /tmp/test.txt"], stdin: xxxx })
Need to change commands for each environment. (/bin/sh for Linux and Mac, cmd for Windows)
Check for corruption under data_files using integirity in dim-lock.json.
SHA-512 is 128 characters in hexadecimal notation, so it is a little difficult to see.
If you are using it for checking corruption rather than for security, consider using a shorter notation such as SHA-1.
Since this is not a file that many people will see, using SHA-512 may not be too much of a problem.
$ dim generate -t [source name or file path] -m [message for post-process] -o [output file path]
$ dim generate -t kankou1 -m "convert to geojson"
Register multiple CKAN as repositories for cross-searching.
Prepare a file for repository management and register CKAN URL.
以下でバージョンを定数として定義していますが、v1.0.4が最新版であるにもかかわらずv1.0.3
のままになっています。これにより、リリースにあるバイナリの最新版をインストールしても New version available: v1.0.4
と出てきます。
Line 2 in 45f09d8
catalogResourceId
and catalogResourceUrl
field are replaced with null
by dim update
when the data was fetched by dim search -i
.
// dim.json
{
"fileVersion": "1.1",
"contents": [
{
"url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
"name": "42_長崎県_長与町避難所_長与町避難所",
"catalogUrl": "https://www.geospatial.jp/ckan/dataset/42000-013",
"catalogResourceId": "f2d3ad73-83db-45e4-a11d-48bdd15fe60b",
"postProcesses": [],
"headers": {}
}
]
}
// dim-lock.json
{
"lockFileVersion": "1.1",
"contents": [
{
"name": "42_長崎県_長与町避難所_長与町避難所",
"url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
"path": "./data_files/42_長崎県_長与町避難所_長与町避難所/14nagayotownhinan.csv",
"catalogUrl": "https://www.geospatial.jp/ckan/dataset/42000-013",
"catalogResourceId": "f2d3ad73-83db-45e4-a11d-48bdd15fe60b",
"lastModified": "2018-08-27T14:32:21.000Z",
"eTag": "ff6b437fe66ac28b776a16a249f62b36",
"lastDownloaded": "2023-05-27T04:45:49.037Z",
"integrity": "d3db097cb5c1213821bb79730d5c895160302f6b",
"postProcesses": [],
"headers": {}
}
]
}
// dim.json
{
"fileVersion": "1.1",
"contents": [
{
"name": "42_長崎県_長与町避難所_長与町避難所",
"url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
"catalogUrl": null,
"catalogResourceId": null,
"postProcesses": [],
"headers": {}
}
]
}
// dim-lock.json
{
"lockFileVersion": "1.1",
"contents": [
{
"name": "42_長崎県_長与町避難所_長与町避難所",
"url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
"path": "./data_files/42_長崎県_長与町避難所_長与町避難所/14nagayotownhinan.csv",
"catalogUrl": null,
"catalogResourceId": null,
"lastModified": "2018-08-27T14:32:21.000Z",
"eTag": "ff6b437fe66ac28b776a16a249f62b36",
"lastDownloaded": "2023-05-27T04:46:51.049Z",
"integrity": "d3db097cb5c1213821bb79730d5c895160302f6b",
"postProcesses": [],
"headers": {}
}
]
}
We think it is possible to speed up the download (304 not modified) by using last_modified and e-tag in dim-lock.json.
Post-processing to run after all installations are complete.
dim install http://example.com/example.xlsx -p "xlsx-to-csv" -p "encode SJIS"
Currently, it is not possible to convert an xlsx file to a csv file and then to SJIS by executing the above command.
This is because if multiple postProcesses are specified, the path to the converted file is not passed to the next process.
$ dim update [name]
If a name is specified
Redownload based on the contents of dim-lock.json
If no name is specified
Redownload based on the contents of dim.json
Passing a download file as standard input allows specifying post-processing to handle the contents of the file.
Output to standard output shall be saved as files under data_files if ">" is specified.
Hash the obtained data with sha-1 and save it to dim-lock.json
When running the installation interactively, the following input has been confirmed to be garbled
If only garbled characters are entered, no garbling occurs.
test fails on current code base running on deno 1.30.x with the following error message:
InstallAction ...
with URL ...
download and check that data_files, dim.json and dim-lock.json are saved. ... ok (20ms)
exit with error when name is not specified ... ok (5ms)
exit with error when run with "name" not recorded in dim.json ... ok (7ms)
overwrite existing files when specified name is duplicated and force is true ... ok (7ms)
download using request headers and check that they are recorded in dim.json and dim-lock.json when specify headers option ... ok (6ms)
encode downloaded file to Shift-JIS and record in dim.json, dim-lock.json when specify "encode sjis" as postProcesses ... ok (8ms)
exit with error when specify "encode utf-8 sjis" as postProcesses, and download ... ok (7ms)
exit with error when specify "encode" as postProcesses, and download. ... ok (6ms)
check that the command for darwin to extract the downloaded file is entered and recorded in dim.json and dim-lock.json. ... ok (6ms)
check that the decompress method is called with two arguments when the os is not darwin. ... ok (6ms)
exit with error when specify "unzip a" as postProcess and download ... ok (4ms)
convert downloaded file from xlsx to csv and record in dim.json and dim-lock.json when specify "xlsx-to-csv" as postProcesses ... ok (26ms)
convert downloaded file from xls to csv and record in dim.json and dim-lock.json when specify "xlsx-to-csv" as postProcesses ... ok (16ms)
exit with error when specify "xlsx-to-csv a" as postProcesses and download ... ok (14ms)
download file and execute echo command with downloaded file path as standard output when specify "cmd echo" as postProcesses ... ok (7ms)
download file and execute echo command with "a" and downloaded file path as standard output when specify "cmd echo a" as postProcesses ... ok (6ms)
exit with error when specify "cmd" as postProcesses and download ... ok (6ms)
output log and ignore error when specify error command such as "cmd aaa" as postProcesses ... ok (10ms)
exit with error when specify "aaa" as postProcess and download ... ok (5ms)
exit with error when if the URL is incorrectly described. ... FAILED (6ms)
error: AssertionError: spy not called with expected args:
[Diff] Actual / Expected
[
"\x1b[31mFailed to install.\x1b[39m",
- "\x1b[31mInvalid URL: 'aaa'\x1b[39m",
+ "\x1b[31mInvalid URL\x1b[39m",
]
throw new AssertionError(
^
at assertSpyCall (https://deno.land/[email protected]/testing/mock.ts:542:15)
at Object.<anonymous> (file:///home/osoken/Documents/works/projects/cfj/dim/repo/dim/tests/libs/actions.install.test.ts:798:9)
at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:358:7)
at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:346:9)
at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:346:9)
at async fn (https://deno.land/[email protected]/testing/_test_suite.ts:316:13)
exit with error when failed to download ... ok (7ms)
exit with error when execute with URL and file path ... ok (6ms)
with URL ... FAILED (204ms)
The message displayed when a local file other than dim.json is specified for the -f option during installation is confusing.
local case
Not found a dim.json. You should run a 'dim init'
URL case
Selecting other than json.
dim install http://xxxx/data -n data -fn data.json
Create a search command
Create a search function
Search Results
$ dim search xxxxxx
package_title1
- package_url
- package_description
- package_license
1.resource_name1
* resource_url1
* resource_description1
* created1
* format
2.resource_name2
* resource_url2
* resource_description2
* created2
* format
package_title2
- package_url
- package_description
- package_license
3.resource_name3
* resource_url3
* resource_description3
* created3
* format
4.resource_name4
* resource_url4
* resource_description4
* created4
* format
"/" can be specified in the -n option.
This results in the creation of subdirectories within data_files.
If "/" is specified in the -n option, it is considered necessary to replace it with another character or take other measures.
Install all data that matches the regular expression
dim install https://opendata.go.jp/xxx/{{.*}}/xxxx.json
Specify HTTP headers in the same way as for the install command.
It seems that "XLS to CSV" converter is somehow half working.
I tried
dim install https://www.city.chofu.tokyo.jp/www/contents/1489047638868/simple/1.xls -n "東京都調布市市立小・中学校一覧" -p xlsx-to-csv
and the result was only 1.xls was downloaded but it was converted to CSV.
I would like to see that both 1.xls and 1.csv are placed in the data_files folder.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.