paypal / gimel Goto Github PK
View Code? Open in Web Editor NEWBig Data Processing Framework - Unified Data API or SQL on Any Storage
Home Page: http://gimel.io
License: Apache License 2.0
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Home Page: http://gimel.io
License: Apache License 2.0
Currently we have document docs/getting-started.md.
We will move it to folder docs/getting-started.
Break down all the documentation in one markdown file - into several files.
There are few steps to correct in the Try-Gimel
documentation.
GIMEL_HOME
shell variablecd ..
& always have user refer to $GIMEL
bintray
)While trying to amend commit messages - there was an accidental force push to master that caused several commits to be lost. This issue is to restore the lost commit.
Currently we have integration enabled with http://gimel.readthedocs.io/.
This gives everyone search capability.
But, the landing page itself is not rendering as we need to bootstrap readthedocs with a simple file.
https://docs.readthedocs.io/en/latest/getting_started.html
The code that capture and logs method names is scattered over many places (40+ files).
def MethodName: String = new Exception().getStackTrace.apply(1).getMethodName
logger.info(" @Begin --> " + MethodName)
It is also cryptic, uses bad coding practices and violates Scala coding convention.
This code should be refactored into one place, ideally in gimel.logging.Logger
.
Spark open source world has Tableau Data Extract connector. We can add a similar connector on Gimel to support creating Tableau Data Extract directly from spark dataframe.
https://github.com/werneckpaiva/spark-to-tableau/blob/master/README.md
Push the Gimel Open Source Code to the Repo
The test cases in modules are not running because the scalatest maven plugin is missing module pom files does not.
To reproduce:
$ mvn test
Expected behavior:
DataSetUtilsSpec.scala
test suit should run and all test cases in it should pass.
Actual behavior:
DataSetUtilsSpec.scala
does NOT run.
Please add a Jira connector on Gimel to support extracting jira issues directly from Jira to process as a dataframe.
Problem Statement
Current Implementation of Gimel-Core Data API loads all the connectors, because dependencies are compile time.
With this pattern, a use-case that requires say just Kafka and ElasticSearch connectors are also loading other connectors Gimel Supports.
The implication is
Solution
With UnPayPal & Open Sourcing, we need to start pushing Gimel Artifacts to Maven Central.
Add the necessary hooks/ implementations to enable Build + Push Artifacts to Maven Central.
Adding the mapping for Spark UI port 4040 in docker compose, so that users can view the queries they are running in spark-shell on Spark UI at http://localhost:4040
Spark open source world has SFTP connector. We can add a similar connector on Gimel to support SFTP or Dropzone in/out data consumption.
https://github.com/springml/spark-sftp/blob/master/README.md
Thanks
Prabhu
Replace gimel.kafka.bootstrap.servers to bootstrap.servers as per the API
Please add a Twitch Streaming connector on Gimel to support extracting twitch feeds directly from Twitch Streamer to process as a dataframe.
https://github.com/agapic/Twitch-Streamer/blob/master/README.md
The current project is missing:
.rat-excludes
file to list the set of files that do not need apache license 2 headers.rat-excludes
list (e.g. pom.xml files, shell scripts, .properties files
). Files that don't require apache 2 license headers are mentioned on http://www.apache.org/legal/src-headers.html#faq-exceptionsCurrently CheckStyle is set to Warning Level for several modules. This is because we have to do a lot of code cleanup to adhere to rules specified in checkstyle specification.
Without adding/changing any logic : only address the style sheet errors.
A plugin can be used to auto-fix the checkstyle violations. But this requires post-fix validation to ensure there is no inadvertent logic changes performed by any plugin that is used.
Other way is to fix manually each module.
Ensure the current code base adheres to the checkstyle by running the autofixer on the entire code base.
Please add a Github connector on Gimel to support extracting github feeds like issues, commits, changes, etc directly from Github to process as a dataframe.
https://github.com/lightcopy/spark-github-pr/blob/master/README.md
Add redshift connector to Gimel
Update user and developer forum links.
Need to add a code of conduct doc for community guidelines.
Please add a Twitter connector on Gimel to support strea twitter tweets using twitter API to process as a dataframe.
It is very difficult to unit-test Gimel, because a lot of dependencies, configurations, and options are hard-coded in Gimel source code and it is almost impossible to change them or mock them. Gimel should be refactored so we can unit-test Gimel.
Add Test Cases in the most sensitive segments of the Kafka Module.
We are currently using regex pattern simple parser to parse GSQL queries. We have to implement Spark SQL Parser to parse GSQL queries.
https://github.com/jaceklaskowski/sql-parser/blob/master/SparkParser.scala
Update the PR template with:
Integrate gimel with Codacy
Remove unused code to improved code quality.
Refer here for complete report.
https://app.codacy.com/app/Dee-Pac/gimel/dashboard?bid=6990553
With UnPayPal & Open Sourcing, the CI tied to PayPal Jenkins has been unpinned.
Add a new CI to continue with Builds for Master & Pull Requests.
An Option is travis CI.
Currently, Gimel Properties is built by loading resources/properties file.
However, we need to provide capability for user to load the file that is external to JVM.
It is not a scalable approach to compile the code each time property changes !
Spark open source world has Workday connector. We can add a similar connector on Gimel to support Workday in/out data consumption.
Build Scripts are Missing the Apache License.
Description
Default gimel properties file(pcatalog.properties) is not loaded properly.
Fix
There is an issue in the path assigned to the properties file.
As a developer, I would like GH Issues that are fixed to automatically close when the referencing PR is merged.
This can be done by adding a set of reserved words to the body of either the PR or the commit message as described here.
I propose updating the PR template to include this so that we ensure all PRs are compliant with this practice.
Add Dynamo DB connector to Gimel
Currently, the Github pages is rendering via an index.md that I created by making a copy of the README.md.
However, we can enhance this by making an index.html & CSS style sheet combination. This will render a static page providing a better user experience.
Currently we are suggesting everyone to use build/gimel
to build Gimel.
However this has to be enhanced for reasons below -
installs
entire project.Following are some fixes needed in quickstart scripts:
safemode: Call From e4e0ae5d275a/172.18.0.4 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
20180411 11:43:02 | Error | Stage | docker exec -it namenode hadoop dfsadmin -safemode leave
20180411 11:43:02 | Error | Stage | sh /Users/pkasinathan/workspace/gimel-oss/gimel-dataapi/gimel-quickstart/bootstrap.sh kafka,elasticsearch
20180411 16:39:39 | -----------------------------------------------------------------
20180411 16:39:39 | Deleting topic if exists...
20180411 16:39:39 | -----------------------------------------------------------------
20180411 16:39:39 | Executing Command --> docker exec -it kafka kafka-topics --delete --zookeeper zookeeper:2181 --topic gimel.demo.flights
Error while executing topic command : Topic gimel.demo.flights does not exist on ZK path zookeeper:2181
[2018-04-11 23:39:43,609] ERROR java.lang.IllegalArgumentException: Topic gimel.demo.flights does not exist on ZK path zookeeper:2181
at kafka.admin.TopicCommand$.deleteTopic(TopicCommand.scala:180)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:71)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
(kafka.admin.TopicCommand$)
20180411 16:39:42 | Error | Stage | docker exec -it kafka kafka-topics --delete --zookeeper zookeeper:2181 --topic gimel.demo.flights
20180411 16:39:47 | ------------------------------------------------------------------------------
20180411 16:40:28 | ------------------------------------------------------------------------------
20180411 16:40:28 | ALL STORAGE CONTAINERS - LAUNCHED
20180411 16:40:28 | ------------------------------------------------------------------------------
20180411 16:40:28 |
20180411 16:40:28 |
20180411 16:40:28 | ------------------------------------------------------------------------------
20180411 16:40:28 | Setting up spark container...
20180411 16:40:28 | ------------------------------------------------------------------------------
20180411 16:40:28 | Executing Command --> docker cp /Users/pkasinathan/workspace/gimel-oss/gimel-dataapi/gimel-standalone/lib/gimel-sql-1.2.0-SNAPSHOT-uber.jar spark-master:/root/
20180411 16:40:33 | Success | Stage | docker cp /Users/pkasinathan/workspace/gimel-oss/gimel-dataapi/gimel-standalone/lib/gimel-sql-1.2.0-SNAPSHOT-uber.jar spark-master:/root/
20180411 16:40:33 | Executing Command --> docker cp hive-server:/opt/hive/conf/hive-site.xml /Users/pkasinathan/workspace/gimel-oss/tmp/hive-site.xml
20180411 16:40:33 | Success | Stage | docker cp hive-server:/opt/hive/conf/hive-site.xml /Users/pkasinathan/workspace/gimel-oss/tmp/hive-site.xml
20180411 16:40:33 | Executing Command --> docker cp /Users/pkasinathan/workspace/gimel-oss/tmp/hive-site.xml spark-master:/spark/conf/
20180411 16:40:34 | Success | Stage | docker cp /Users/pkasinathan/workspace/gimel-oss/tmp/hive-site.xml spark-master:/spark/conf/
20180411 16:40:34 | Executing Command --> docker cp hbase-master:/opt/hbase-1.2.6/conf/hbase-site.xml /Users/pkasinathan/workspace/gimel-oss/tmp/hbase-site.xml
Error: No such container:path: hbase-master:/opt/hbase-1.2.6/conf/hbase-site.xml
20180411 16:40:34 | Error | Stage | docker cp hbase-master:/opt/hbase-1.2.6/conf/hbase-site.xml /Users/pkasinathan/workspace/gimel-oss/tmp/hbase-site.xml
20180411 16:40:34 | Error | Stage | sh /Users/pkasinathan/workspace/gimel-oss/gimel-dataapi/gimel-quickstart/bootstrap.sh kafka,elasticsearch
pkasinathan@LM-SJN-21013346:~/workspace/gimel-oss $
scala> import com.paypal.gimel._
import com.paypal.gimel._
scala> val dataSet = DataSet(spark)
dataSet: com.paypal.gimel.DataSet = com.paypal.gimel.DataSet@4c49471c
scala> val df = dataSet.read("pcatalog.flights_hdfs")
Catalog Provider is --> USER
Resolving Catalog Via catalogProvider --> USER
java.util.NoSuchElementException: key not found: pcatalog.flights_hdfs.dataSetProperties
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at com.paypal.gimel.common.catalog.CatalogProvider$.getDataSetProperties(CatalogProvider.scala:72)
at com.paypal.gimel.DataSet.read(DataSet.scala:89)
... 52 elided
scala> df.count
res10: Long = 1398164
scala>
Currently, in bootstrap sql scripts in gimel-dataapi/gimel-standalone/sql folder, the license header is commented with "#", It should be replaced with "--", as hive sql file accepts "--" as a comment.
The Gimel Standalone feature will provide capability for developers / users alike to
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.