googledatalab / pydatalab Goto Github PK
View Code? Open in Web Editor NEWGoogle Datalab Library
License: Apache License 2.0
Google Datalab Library
License: Apache License 2.0
%chart magic fails when queries use Standard SQL, regardless of whether the module referenced specifies dialect.
To validate:
When "myModule" is defined as:
%%sql
SELECT a, b FROM [table]
The following works:
%chart bars -d myModule -f a,b
However, if "myModule" is defined as:
%%sql
SELECT a,b FROM table
The same command (%chart bars -d myModule -f a,b) fails.
Either adding a dialect flag for chart or respecting flags from the module would suffice to solve this.
How difficult would it be to add utilities for Google Cloud Datastore ?
I noticed the following error in the Datalab console when using the %storage
line magic command.
{"name":"app","hostname":"b86c9881eefb","pid":89,"type":"jupyter","level":30,"msg":"[9000]: ERROR:root:Cell magic `%%storage` not found (But line magic `%storage` exists, did you mean that instead?).\n","time":"2016-09-26T11:35:28.505Z","v":0}
The error appears in the console when the following code is executed, in an otherwise empty cell:
%storage read --object gs://cloud-datalab-samples/cars.csv --variable cars
print(cars)
Using a new cell, I can confirm that cars
is not defined
>print(cars)
NameErrorTraceback (most recent call last)
<ipython-input-16-b33add6e2bbd> in <module>()
----> 1 print cars
NameError: name 'cars' is not defined
The code works as expected if I put %storage
on the second line (first line can be empty, comment, or python code). For example,
# This comment is here to work around the issue where %storage doesn't work if it is the first line in a cell.
%storage read --object gs://cloud-datalab-samples/cars.csv --variable cars
print(cars)
The output is correctly shown as:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture Extended Edition","",4900.00
1999,Chevy,"Venture Extended Edition",Very Large,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
Our current plotly code in charting.ts only supports line, scatter and heatmap charts, and does a fair amount of munging to go from the format of the data we pass to charting to the format needed by plotly. At some point we should clean this up and simplify/expand the interface to plotly. We also need to check the static chart generation; it may have been affected by plotly/plotly.js#446
Currently, it's possible to create DataSet from the file only. This assumes that my file includes a valid data. This usually not the case, almost all the raw CSV files will include some broken columns and fields. For example, classical Titanic data in the csv file. It is impossible to load the Titanic data to the DataSet with the following features description:
import google.cloud.ml.features as features
class TitanicFeatures(object):
"""This class is generated from command line:
%%mlalpha features
path: /content/datalab/ml/titanic/titanic.csv
headers: Id,Name,PClass,Age,Sex,Survived,SexCode
target: Survived
id: Id
format: csv
Please modify it as appropriate!!!
"""
csv_columns = ('Id','Name','PClass','Age','Sex','Survived','SexCode')
Survived = features.target('Survived').discrete()
Id = features.key('Id')
attrs = [
features.categorical('Name'),
features.numeric('Age'),
features.categorical('PClass'),
features.categorical('Sex'),
features.categorical('SexCode'),
]
Any attempt to load the data with the following code:
%%mlalpha dataset --name titanic_ds
source:
train: /content/datalab/ml/titanic/titanic.csv
featureset: TitanicFeatures
will result in a ValueError:
ValueError: could not convert string to float: Age
So IMHO it should be useful to have an ability to create DataSet from the DataFrame, that I will use prior to the creation of a DataSet in order to run initial data cleaning.
There are probably other places we need to do this to.
If we do a %storage copy of a folder, we are expanding this to the objects to be copied and then copying them in turn. We should do these in parallel. We can do this pretty easily with out @async magic and wait_all().
when you store object on cloud storage with gzip transfer encoding, then item.read_from would fail with exception
try uploding file to cloud storage with gsutil cp -z and without -z option; the latter would fetch OK while first will fail:
/usr/local/lib/python2.7/dist-packages/datalab/storage/_item.pyc in read_from(self, start_offset, byte_count)
180 start_offset=start_offset, byte_count=byte_count)
181 except Exception as e:
--> 182 raise e
183
184 def read_lines(self, max_lines=None):
Exception: Failed to send HTTP request.
The following example from the BigQuery UDF SQL Reference doesn't work in Datalab. However, the example works on https://bigquery.cloud.google.com
Example:
%%sql -d standard
CREATE TEMPORARY FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x*y;
""";
WITH numbers AS
(SELECT 1 AS x, 5 as y
UNION ALL
SELECT 2 AS x, 10 as y
UNION ALL
SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;
After running the above code, I see the following error
%sql arguments: invalid syntax (<string>, line 1) from code 'CREATE TEMPORARY FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x*y;
""";'
invalid: Table name cannot be resolved: dataset name is missing
For example, running the following in a new notebook:
%bigquery schema --table cloud-datalab-samples:httplogs.logs_20140615
It generates nothing in output.
This is to avoid getting these in the wrong order and to improve readability.
In pydatalab/datalab/context/_utils.py, get_project_id(), the gcloud command is not a valid command. At least it does not work with gcloud version 140
#40 added an option to specify whether to use the standard SQL dialect, but this is done on a per-query basis. On notebooks with lots of queries specifying dialect=standard each time is cumbersome and error prone, it'd be nice to have an option to change the default on a per-notebook level.
In the Importing and Exporting Data BigQuery sample notebook, an exception occurs when running table.to_file('/tmp/cars.csv')
> table.to_file('/tmp/cars.csv')
TypeErrorTraceback (most recent call last)
<ipython-input-16-90b5d4821467> in <module>()
----> 1 table.to_file('/tmp/cars.csv')
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/_table.pyc in to_file(self, destination, format, csv_delimiter, csv_header)
648 for column in self.schema:
649 fieldnames.append(column.name)
--> 650 writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter=csv_delimiter)
651 if csv_header:
652 writer.writeheader()
/usr/lib/python2.7/csv.pyc in __init__(self, f, fieldnames, restval, extrasaction, dialect, *args, **kwds)
135 extrasaction)
136 self.extrasaction = extrasaction
--> 137 self.writer = writer(f, dialect, *args, **kwds)
138
139 def writeheader(self):
TypeError: "delimiter" must be string, not unicode
Cloud monitoring supports distribution return types (eg, for a histogram)
https://cloud.google.com/monitoring/api/ref_v3/rest/v3/TypedValue#distribution
Please provide a sample in the README describing how to render a histogram distribution.
Using the following sample code from the Asynchronous Methods and Jobs wiki page:
job = double(10)
while !job.is_complete():
# Do something else
print "waiting..."
if job.failed():
print "Failed! %s" % (', '.join(job.errors))
else:
print job.result()
I receive the following exception:
File "<ipython-input-35-xxxxxxxx>", line 2
while !job.is_complete():
^
SyntaxError: invalid syntax
This was meant for when we had deployable query pipelines, but the original plans fell through. We should either remove the command or figure out its purpose going forward.
There may be a use case for users upload content types other than 'text/plain' when using the %%storage write...
magic command .
BigQuery recently added support for Standard SQL (beta). It would be helpful to have the option to choose between Legacy SQL and Standard SQL.
Links:
https://cloud.google.com/blog/big-data/2016/06/bigquery-111-now-with-standard-sql-iam-and-partitioned-tables
https://cloud.google.com/bigquery/sql-reference/
https://cloud.google.com/bigquery/sql-reference/enabling-standard-sql
%%sql --help
results in an exception. This exists in the datalab package used in gcr.io/cloud-datalab/datalab:local
>%%sql --help
TypeErrorTraceback (most recent call last)
<ipython-input-1-88722d61f288> in <module>()
----> 1 get_ipython().run_cell_magic(u'sql', u'-h', u'')
/usr/local/lib/python2.7/dist-packages/datalab/kernel/__init__.pyc in _run_cell_magic(self, magic_name, line, cell)
89 fn = self.find_line_magic(magic_name)
90 if fn:
---> 91 return _orig_run_line_magic(self, magic_name, line)
92 # IPython will complain if cell is empty string but not if it is None
93 cell = None
/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2065 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2066 with self.builtin_trap:
-> 2067 result = fn(*args,**kwargs)
2068 return result
2069
TypeError: sql() takes exactly 2 arguments (1 given)
There may be a use case for users to have the ability to override the project-wide default BigQuery billing tier on a per-query basis. This can be done by adding configuration.query.maximumBillingTier
as part of the query request.
An exception occurs when running the following code:
%%sql --module MyModule
DEFINE QUERY q1
WITH MyTempTable AS
(
SELECT 'Datalab' AS Name
)
SELECT * FROM MyTempTable
SELECT * FROM $q1
import datalab.bigquery as bq
bq.Query(MyModule.q1).results(dialect='standard')
Exception:
ExceptionTraceback (most recent call last)
<ipython-input-5-e39df25b5c74> in <module>()
1 import datalab.bigquery as bq
----> 2 bq.Query(MyModule.q1).results(dialect='standard')
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/_query.pyc in results(self, use_cache, dialect, billing_tier)
226 """
227 if not use_cache or (self._results is None):
--> 228 self.execute(use_cache=use_cache, dialect=dialect, billing_tier=billing_tier)
229 return self._results.results
230
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/_query.pyc in execute(self, table_name, table_mode, use_cache, priority, allow_large_results, dialect, billing_tier)
524 job = self.execute_async(table_name=table_name, table_mode=table_mode, use_cache=use_cache,
525 priority=priority, allow_large_results=allow_large_results,
--> 526 dialect=dialect, billing_tier=billing_tier)
527 self._results = job.wait()
528 return self._results
/usr/local/lib/python2.7/dist-packages/datalab/bigquery/_query.pyc in execute_async(self, table_name, table_mode, use_cache, priority, allow_large_results, dialect, billing_tier)
490 except KeyError:
491 # The query was in error
--> 492 raise Exception(_utils.format_query_errors(query_result['status']['errors']))
493 return _query_job.QueryJob(job_id, table_name, self._sql, context=self._context)
494
Exception: invalidQuery: Syntax error: Unexpected end of statement at [1:1]
I don't see an exception when running the code below:
%%sql --module MyModule
DEFINE QUERY q1
SELECT 'Datalab' AS Name
SELECT * FROM $q1
import datalab.bigquery as bq
bq.Query(MyModule.q1).results(dialect='standard')
The issue seems to occur only when the WITH
clause is used in the same cell as DEFINE QUERY
Thanks to @HaipengSu for discovering this !
I'm working on a PR to correct this issue.
Basically we'll need something like https://github.com/googledatalab/datalab/tree/datalab-managed/docs.
I've defined an UDF on Datalab like this:
%%bigquery udf --module foo
...
And I was able to use it on a BQ query:
%%sql --module foo
...
But when I try to refer these from a chart, I got "Unknown TVF" error.
%%chart scatter --data foo
...
An issue was reported on StackOverflow related to the dialect parameter not being honored when using %%sql --module ...
. This feels like a bug.
The -d standard
which is used to set the BigQuery dialect to Standard SQL has no effect in the following example:
%%sql --module data_name -d standard
Currently users still need to specify the dialect parameter when calling bq.Query.to_dataframe()
which could be confusing.
bq.Query(data_name).to_dataframe(dialect='standard')
The following works as expected (i.e. when --module
is not used in %%sql
)
%%sql -d standard <sql query>
I see the following error message if I put a comment above my SQL query when using %%sql
>%sql arguments: name 'comment' is not defined from code '--comment
Using the following code:
%%sql
--comment
SELECT * FROM MyDataSet.MyTable
If I move the comment after my SQL query, I won't see the error:
%%sql
SELECT * FROM MyDataSet.MyTable
--comment
For example, %%bigquery sample --help:
usage: bigquery sample [-h] [-q QUERY | -t TABLE | -v VIEW]
[-d {legacy,standard}] [-b BILLING] [-c COUNT]
[-m {limit,random,hashed,sorted}] [-p PERCENT]
[-f FIELD] [-o {ascending,descending}] [-P] [--verbose]
But it does not display the help message:
"
Display a sample of the results of a BigQuery SQL query. The cell can
optionally contain arguments for expanding variables in the query, if
-q/--query was used, or it can contain SQL for a query.
"
While setting up my development environment, I noticed that the installation instructions in file https://github.com/googledatalab/pydatalab/blob/master/README.md should direct users to https://github.com/googledatalab/pydatalab.git instead of https://github.com/googledatalab/datalab.git.
Running pydatalab within Jupyter, gives an import error for the ML module. Within Datalab, it works fine.
ImportError: No module named ml
I guess the required libraries need to be included in setup.py?
%projects set my-new-project
Datalab project is updated from UI (from the signin button). But:
!gcloud config list project
Your active configuration is: [default]
[core]
project = my-old-project
Restarting Datalab doesn't work either.
Hello,
We've been using the pydatalabs modules to interact with big query in an application. When running from a local machine with my personal account this works like a charm, however when using a service account from within a docker container all commands seem to fail with:
RequestException: HTTP request failed: Invalid Credentials
However when I use gcloud.bigquery I don't seem to have this problem.
In more detail I'm creating a docker image and including the head of pydatalabs in it. I'm then building this with "pip install -e ." My code is then of the form:
import datalab.bigquery as bq
table_name = "my_project:my_dataset.my_table"
bq.Table(table_name).exists()
When run I get the following:
File "main.py", line 75, in main
test = bq.Table(table_name).exists()
File "/pydatalab/datalab/bigquery/_table.py", line 189, in exists
raise e
RequestException: HTTP request failed: Invalid Credentials
I'm exporting my service account credentials to docker via a command of the form:
docker run
-v my_creds.json:/etc/key_file.json
-e GOOGLE_APPLICATION_CREDENTIALS=/etc/key_file.json
my_docker_image
As far as I can tell the credentials are correct and make it into the image, also when following the above methodology both gcloud.bigquery and gcloud.pubsub work.
Thanks
Matthew
I noticed an issue where context.credentials.get_access_token()
does not return the most recent oauth2client access token.
Steps to reproduce
Google Cloud SDK
in your Google Account Security Settingsinvalid_grant: Token has been revoked.
invalid_grant: Token has been revoked.
>from datalab.context import Context
>context = Context.default().credentials.get_access_token()
HttpAccessTokenRefreshErrorTraceback (most recent call last)
<ipython-input-16-db0a2becc447> in <module>()
1 from datalab.context import Context
2
----> 3 context = Context.default().credentials.get_access_token()
/usr/local/lib/python2.7/dist-packages/oauth2client/client.pyc in get_access_token(self, http)
773 if not http:
774 http = httplib2.Http()
--> 775 self.refresh(http)
776 return AccessTokenInfo(access_token=self.access_token,
777 expires_in=self._expires_in())
/usr/local/lib/python2.7/dist-packages/oauth2client/client.pyc in refresh(self, http)
656 request.
657 """
--> 658 self._refresh(http.request)
659
660 def revoke(self, http):
/usr/local/lib/python2.7/dist-packages/oauth2client/client.pyc in _refresh(self, http_request)
861 """
862 if not self.store:
--> 863 self._do_refresh_request(http_request)
864 else:
865 self.store.acquire_lock()
/usr/local/lib/python2.7/dist-packages/oauth2client/client.pyc in _do_refresh_request(self, http_request)
930 except (TypeError, ValueError):
931 pass
--> 932 raise HttpAccessTokenRefreshError(error_msg, status=resp.status)
933
934 def _revoke(self, http_request):
HttpAccessTokenRefreshError: invalid_grant: Token has been revoked.
I've confirmed that oauth2client returns the correct token, which is still valid. Here is the test code:
import oauth2client
auth_header = "Authorization:\ Bearer\ " + oauth2client.client.GoogleCredentials.get_application_default().get_access_token()[0]
!curl -H "Metadata-Flavor: Google" -H $auth_header https://www.googleapis.com/oauth2/v1/userinfo
It seems it is unlikely that users will hit this specific issue.
An additional possibility is that the access token expires. The user signs in successfully, but datalab is still using the expired access token.
This second scenario, access token expiration, may not be an issue based on the following link that says access tokens will be refreshed automatically by oauth2client.
Update your interpreter locally to 3.5 or use travis 3.5 and these tests fail with the following error:
Traceback (most recent call last):
File "/home/travis/virtualenv/python3.5.2/lib/python3.5/site-packages/mock/mock.py", line 1305, in patched
return func(*args, **keywargs)
File "/home/travis/build/yebrahim/pydatalab/tests/stackdriver/monitoring/query_metadata_tests.py", line 70, in test_as_dataframe
dataframe = query_metadata.as_dataframe()
File "/home/travis/virtualenv/python3.5.2/lib/python3.5/site-packages/datalab/stackdriver/monitoring/_query_metadata.py", line 68, in as_dataframe
for ts in self._timeseries_list[:max_rows]]
File "/home/travis/virtualenv/python3.5.2/lib/python3.5/site-packages/datalab/stackdriver/monitoring/_query_metadata.py", line 68, in <listcomp>
for ts in self._timeseries_list[:max_rows]]
AttributeError: 'Resource' object has no attribute '__dict__'
Currently, "cloud-ml-test-automated" is used in the generated code cells, for example in the Iris/2.Preprocess notebook.
Not sure if it is Google Map or Datalab issue, but today we don't know how to display more than 400 markers (lat/long)
3 days ago we opened a Google Map ticket:
https://code.google.com/p/gmaps-api-issues/issues/detail?id=9801&can=4&colspec=ID%20Type%20Status%20Introduced%20Fixed%20Summary%20Stars%20ApiType%20Internal
While exporting an IPython notebook to HTML, the google charts are missing in the generated HTML. We have noticed this with - treemap and sankey diagrams.
Also, the HTML page shows console errors like:
Failed to load resource: the server responded with a status of 404 (Not Found)
http://127.0.0.1:8082/nbconvert/html/extensions/charting.jsUncaught Error: Script error for: extensions/charting
http://requirejs.org/docs/errors.html#scripterror
The charts export fine when generated from other libraries like matplotlib/seaborn, plotly and graphviz.
I notice the following error appears when I try to build pydatalab on Ubuntu 16.10.
tony@tonypc:~/pydatalab-parthea$ ./install-no-virtualenv.sh
datalab/notebook/static/charting.ts(810,11): error TS2399: Duplicate identifier '_this'. Compiler uses variable declaration '_this' to capture 'this' reference.
datalab/notebook/static/charting.ts(812,9): error TS2400: Expression resolves to variable declaration '_this' that compiler uses to capture 'this' reference.
This error is also causing the Datalab build script $REPO_DIR/containers/datalab/build.sh
to fail.
Step 14 : RUN ipython profile create default && jupyter notebook --generate-config && if [ -d /datalab/lib/pydatalab/.git ]; then echo "use local lib"; else git clone https://github.com/googledatalab/pydatalab.git /datalab/lib/pydatalab; fi && cd /datalab/lib/pydatalab && /tools/node/bin/npm install -g typescript && tsc --module amd --noImplicitAny --outdir datalab/notebook/static datalab/notebook/static/*.ts && /tools/node/bin/npm uninstall -g typescript && pip install --no-cache-dir . && jupyter nbextension install --py datalab.notebook && jupyter nbextension enable --py widgetsnbextension && rm datalab/notebook/static/*.js && mkdir -p /datalab/nbconvert && cp -R /usr/local/share/jupyter/nbextensions/gcpdatalab/* /datalab/nbconvert && ln -s /usr/local/lib/python2.7/dist-packages/notebook/static/custom/custom.css /datalab/nbconvert/custom.css && mkdir -p /usr/local/lib/python2.7/dist-packages/notebook/static/components/codemirror/mode/text/sql/text && ln -s /usr/local/share/jupyter/nbextensions/gcpdatalab/codemirror/mode/sql.js /usr/local/lib/python2.7/dist-packages/notebook/static/components/codemirror/mode/text/sql/text/sql.js && cd /
---> Running in 6ade2d2451af
[ProfileCreate] Generating default config file: u'/root/.ipython/profile_default/ipython_config.py'
[ProfileCreate] Generating default config file: u'/root/.ipython/profile_default/ipython_kernel_config.py'
Writing default config to: /root/.jupyter/jupyter_notebook_config.py
use local lib
/tools/node/bin/tsc -> /tools/node/lib/node_modules/typescript/bin/tsc
/tools/node/bin/tsserver -> /tools/node/lib/node_modules/typescript/bin/tsserver
[email protected] /tools/node/lib/node_modules/typescript
datalab/notebook/static/charting.ts(810,11): error TS2399: Duplicate identifier '_this'. Compiler uses variable declaration '_this' to capture 'this' reference.
datalab/notebook/static/charting.ts(812,9): error TS2400: Expression resolves to variable declaration '_this' that compiler uses to capture 'this' reference.
The command '/bin/sh -c ipython profile create default && jupyter notebook --generate-config && if [ -d /datalab/lib/pydatalab/.git ]; then echo "use local lib"; else git clone https://github.com/googledatalab/pydatalab.git /datalab/lib/pydatalab; fi && cd /datalab/lib/pydatalab && /tools/node/bin/npm install -g typescript && tsc --module amd --noImplicitAny --outdir datalab/notebook/static datalab/notebook/static/*.ts && /tools/node/bin/npm uninstall -g typescript && pip install --no-cache-dir . && jupyter nbextension install --py datalab.notebook && jupyter nbextension enable --py widgetsnbextension && rm datalab/notebook/static/*.js && mkdir -p /datalab/nbconvert && cp -R /usr/local/share/jupyter/nbextensions/gcpdatalab/* /datalab/nbconvert && ln -s /usr/local/lib/python2.7/dist-packages/notebook/static/custom/custom.css /datalab/nbconvert/custom.css && mkdir -p /usr/local/lib/python2.7/dist-packages/notebook/static/components/codemirror/mode/text/sql/text && ln -s /usr/local/share/jupyter/nbextensions/gcpdatalab/codemirror/mode/sql.js /usr/local/lib/python2.7/dist-packages/notebook/static/components/codemirror/mode/text/sql/text/sql.js && cd /' returned a non-zero code: 2
I'm going to submit a PR for this.
Here is the code and the output in a Jupyter notebook cell. Expected it to bring in dependencies; instead got this error. Also, shouldn't 2.5.3 of dateutil satisfy the requirement?
%%bash
pip install -i https://testpypi.python.org/pypi datalab
Collecting datalab
Using cached https://testpypi.python.org/packages/74/54/2654870a38ed720ee9afc3e5924721bff85cb10f25564f288f036c1d196c/datalab-0.1.1606020925-py2-none-any.whl
Collecting python-dateutil==2.5.0 (from datalab)
Could not find a version that satisfies the requirement python-dateutil==2.5.0 (from datalab) (from versions: )
No matching distribution found for python-dateutil==2.5.0 (from datalab)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.