Giter Club home page Giter Club logo

maha's Introduction

Pipeline Status

Google Group: Maha-Users

Maha Release Notes

Maha Release Pipeline

Maha

A centralised library for building reporting APIs on top of multiple data stores to exploit them for what they do best.

We run millions of queries on multiple data sources for analytics every day. They run on hive, oracle, druid etc. We needed a way to utilize the data stores in our architecture to exploit them for what they do best. This meant we needed to easily tune and identify sets of use cases where each data store fits the best. Our goal became to build a centralized system which was able to make these decisions on the fly at query time and also take care of the end to end query execution. The system needed to take in all the heuristics available, applying any constraints already defined in the system and select the best data store to run the query. It then would need to generate the underlying queries and pass on all available information to the query execution layer in order to facilitate further optimization at that layer.

Key Features!

  • Configuration driven API making it easy to address multiple reporting use cases
  • Define cubes across multiple data sources (oracle, druid, hive)
  • Dynamic selection of query data source based on query cost, grain, weight
  • Dynamic query generation with support for filter and ordering on every column, pagination, star schema joins, query type etc
  • Pluggable partitioning scheme, and time providers
  • Access control based on schema/labeling in cube definitions
  • Define constraints on max lookback, max days window in cube definitions
  • Provide easily aliasing of physical column names across tables / engines
  • Query execution for Oracle, Druid out-of-the-box
  • Support for dim driven queries for entity management alongside metrics
  • API side joins between Oracle/Druid for fact driven or dim driven queries
  • Fault tolerant apis: fall back option to other datasource if configured
  • Supports customising and tweaking data source specific executor's config
  • MahaRequestLog : Kafka logging of API Statistics
  • Support for high cardinality dimension druid lookups
  • Standard JDBC driver to query maha (With Maha Dialect) powered by Avatica and Calcite.

Maha Architecture

Maha Architecture

Modules in maha

  • maha-core : responsible for creating Reporting Request, Request Model (Query Metadata) , Query Generation, Query Pipeline (Engine selection)
  • maha-druid-executor : Druid Query Executor
  • maha-oracle-executor : Oracle Query Executor
  • maha-presto-executor : Presto Query Executor
  • maha-postgres-executor : Postgres Query Executor
  • maha-druid-lookups: Druid Lookup extension for high cardinality dimension druid lookups
  • maha-par-request: Library for Parallel Execution, Blocking and Non Blocking Callables using Java utils
  • maha-service : One json config for creating different registries using the fact and dim definitions.
  • maha-api-jersey : Easy war file helper library for exposing the api using maha-service module
  • maha-api-example : End to end example implementation of maha apis
  • maha-par-request-2: Library for Parallel Execution, Blocking and Non Blocking Callables using Scala utils
  • maha-request-log: Kafka Events writer about the api usage request stats for given registry in maha

Getting Started

Installing Maha API Library

<dependency>
  <groupId>com.yahoo.maha</groupId>
  <artifactId>maha-api-jersey</artifactId>
  <version>6.53</version>
</dependency>
  • maha-api-jersey includes all the dependencies of other modules

Example Implementation of Maha Apis

  • Maha-Service Examples
    • Druid Wiki Ticker Example
    • H2 Database Student Course Example
      • you can run in the local as unit test

Druid Wiki Ticker Example

For this example, you need druid instance running in local and wikitikcer dataset indexed into druid, please take look at http://druid.io/docs/latest/tutorials/quickstart.html

Creating Fact Definition for Druid Wikiticker

          ColumnContext.withColumnContext { implicit dc: ColumnContext =>
        Fact.newFact(
          "wikipedia", DailyGrain, DruidEngine, Set(WikiSchema),
          Set(
            DimCol("channel", StrType())
            , DimCol("cityName", StrType())
            , DimCol("comment", StrType(), annotations = Set(EscapingRequired))
            , DimCol("countryIsoCode", StrType(10))
            , DimCol("countryName", StrType(100))
            , DimCol("isAnonymous", StrType(5))
            , DimCol("isMinor", StrType(5))
            , DimCol("isNew", StrType(5))
            , DimCol("isRobot", StrType(5))
            , DimCol("isUnpatrolled", StrType(5))
            , DimCol("metroCode", StrType(100))
            , DimCol("namespace", StrType(100, (Map("Main" -> "Main Namespace", "User" -> "User Namespace", "Category" -> "Category Namespace", "User Talk"-> "User Talk Namespace"), "Unknown Namespace")))
            , DimCol("page", StrType(100))
            , DimCol("regionIsoCode", StrType(10))
            , DimCol("regionName", StrType(200))
            , DimCol("user", StrType(200))
          ),
          Set(
          FactCol("count", IntType())
          ,FactCol("added", IntType())
          ,FactCol("deleted", IntType())
          ,FactCol("delta", IntType())
          ,FactCol("user_unique", IntType())
          ,DruidDerFactCol("Delta Percentage", DecType(10, 8), "{delta} * 100 / {count} ")
          )
        )
      }
        .toPublicFact("wikiticker_stats",
          Set(
            PubCol("channel", "Wiki Channel", InNotInEquality),
            PubCol("cityName", "City Name", InNotInEqualityLike),
            PubCol("countryIsoCode", "Country ISO Code", InNotInEqualityLike),
            PubCol("countryName", "Country Name", InNotInEqualityLike),
            PubCol("isAnonymous", "Is Anonymous", InNotInEquality),
            PubCol("isMinor", "Is Minor", InNotInEquality),
            PubCol("isNew", "Is New", InNotInEquality),
            PubCol("isRobot", "Is Robot", InNotInEquality),
            PubCol("isUnpatrolled", "Is Unpatrolled", InNotInEquality),
            PubCol("metroCode", "Metro Code", InNotInEquality),
            PubCol("namespace", "Namespace", InNotInEquality),
            PubCol("page", "Page", InNotInEquality),
            PubCol("regionIsoCode", "Region Iso Code", InNotInEquality),
            PubCol("regionName", "Region Name", InNotInEqualityLike),
            PubCol("user", "User", InNotInEquality)
          ),
          Set(
            PublicFactCol("count", "Total Count", InBetweenEquality),
            PublicFactCol("added", "Added Count", InBetweenEquality),
            PublicFactCol("deleted", "Deleted Count", InBetweenEquality),
            PublicFactCol("delta", "Delta Count", InBetweenEquality),
            PublicFactCol("user_unique", "Unique User Count", InBetweenEquality),
            PublicFactCol("Delta Percentage", "Delta Percentage", InBetweenEquality)
          ),
          Set.empty,
          getMaxDaysWindow, getMaxDaysLookBack
        )

Fact definition is the static object specification for the facts and dimension columns present in the table in the data-source, you can say it is object image of the table. DimCol has the base name, data-types, annotation. Annotations are the configurations stating the primary key/foreign key configuration, special character escaping in the query generation, static value mapping ie StrType(100, (Map("Main" -> "Main Namespace", "User" -> "User Namespace", "Category" -> "Category Namespace", "User Talk"-> "User Talk Namespace"), "Unknown Namespace")) . Fact definition can have derived columns, maha supports most common arithmetic derived expression.

Public Fact : Public fact contains the base name to public name mapping. Public Names can be directly used in the Request Json. Public fact are identified by the name called cube name ie 'wikiticker_stats'. Maha supports versioning on the cubes, you have multiple versions of the same cube.

Fact/Dimension Registration Factory: Facts and dimensions are registered under the derived static class object of FactRegistrationFactory or DimensionRegistration Factory. Factory Classes used in the maha-service-json-config.

maha-service-config.json

Maha Service Config json contains one place config for launching maha-apis which includes the following.

  • Set of Public Facts registered under Registry Name ie wikiticker_stats cube is registered under the registry name called wiki
  • Set of Registries
  • Set of Query of generator and their config
  • Set of Query Executors and their config
  • Bucketing configurations containing the cube version based routing of the reporting requests
  • UTC Time provider Maps , if the date /time is local date then you can have utc time provider to convert it to utc in query generation phase.
  • Parallel Service Executor Maps for serving the reporting request utilising the thread-pool config.
  • Maha Request Logging Config, kafka configuration for logging the maha request debug logs to kafka queue.

We have created api-jersey/src/test/resources/maha-service-config.json configuration to start with, this is maha api configuration for student and wiki registry.

Debugging maha-service-config json: For the configuration syntax of this json, you can take look at JsonModels/Factories in the service module. Once Maha Service loads this configuration, if there are some failures in loading the configuration then mahaService will return the list of FailedToConstructFactory/ ServiceConfigurationError/ JsonParseError.

Exposing the endpoints with api-jersey

Api-jersey uses maha-service-config json and create MahaResource beans. All you need to do is to create the following three beans 'mahaService', 'baseRequest', 'exceptionHandler' etc.

    <bean id="mahaService" class="com.yahoo.maha.service.example.ExampleMahaService" factory-method="getMahaService"/>
    <bean id="baseRequest" class="com.yahoo.maha.service.example.ExampleRequest" factory-method="getRequest"/>
    <bean id="exceptionHandler" class="com.yahoo.maha.api.jersey.GenericExceptionMapper" scope="singleton" />
    <import resource="classpath:maha-jersey-context.xml" />

Once your application context is ready, you are good to launch the war file on the web server. You can take look at the test application context that we have created for running local demo and unit test api-jersey/src/test/resources/testapplicationContext.xml

Launch the maha api demo in local

prerequisites
  • druid.io getting started guide in local for wikitiker demo
  • Postman (optional)
Run demo :
  • Step 1: Checkout yahoo/maha repository
  • Step 2: Run mvn clean install in maha
  • Step 3: Go to cd api-example module and run mvn jetty:run, you can run it with -X for debug logs.
  • Step 4: Step 2 will launch jetty server in local and will deploy maha-api example war and you are good to play with it!
Playing with demo :
  • GET Domain request: Dimension and Facts You can fetch wiki registry domain using curl http://localhost:8080/mahademo/registry/wiki/domain Domain tells you lit of cubes and their corresponding list of fields that you can request for particular registry. Here wiki is the registry name.

  • GET Flatten Domain request : Flatten dimension and facts fields You can get flatten domain using curl http://localhost:8080/mahademo/registry/wiki/flattenDomain

  • POST Maha Reporting Request for example student schema MahaRequest will look like following, you need to pass cube name, list of fields you want to fetch, filters, sorting columns etc.

{
   "cube": "student_performance",
   "selectFields": [
      {
         "field": "Student ID"
      },
      {
         "field": "Class ID"
      },
      {
         "field": "Section ID"
      },
      {
         "field": "Total Marks"
      }
   ],
   "filterExpressions": [
      {
         "field": "Day",
         "operator": "between",
         "from": "2017-10-20",
         "to": "2017-10-25"
      },
      {
         "field": "Student ID",
         "operator": "=",
         "value": "213"
      }
   ]
} 

you can find student.json in the api-example module, **make sure you change the dates to latest date range in YYYY-MM-dd to avoid max look back window error.

Curl command :

curl -H "Content-Type: application/json" -H "Accept: application/json" -X POST -d @student.json http://localhost:8080/mahademo/registry/student/schemas/student/query?debug=true 

Sync Output :

{
	"header": {
		"cube": "student_performance",
		"fields": [{
				"fieldName": "Student ID",
				"fieldType": "DIM"
			},
			{
				"fieldName": "Class ID",
				"fieldType": "DIM"
			},
			{
				"fieldName": "Section ID",
				"fieldType": "DIM"
			},
			{
				"fieldName": "Total Marks",
				"fieldType": "FACT"
			}
		],
		"maxRows": 200
	},
	"rows": [
		[213, 200, 100, 125],
		[213, 198, 100, 120]
	]
}
  • POST Maha Reporting Request for example wiki schema

Request :

{
   "cube": "wikiticker_stats",
   "selectFields": [
      {
         "field": "Wiki Channel"
      },
      {
         "field": "Total Count"
      },
      {
         "field": "Added Count"
      },
      {
         "field": "Deleted Count"
      }
   ],
   "filterExpressions": [
      {
         "field": "Day",
         "operator": "between",
         "from": "2015-09-11",
         "to": "2015-09-13"
      }
   ]
}     

Curl :

      curl -H "Content-Type: application/json" -H "Accept: application/json" -X POST -d @wikiticker.json http://localhost:8080/mahademo/registry/wiki/schemas/wiki/query?debug=true  

Output :

{"header":{"cube":"wikiticker_stats","fields":[{"fieldName":"Wiki Channel","fieldType":"DIM"},{"fieldName":"Total Count","fieldType":"FACT"},{"fieldName":"Added Count","fieldType":"FACT"},{"fieldName":"Deleted Count","fieldType":"FACT"}],"maxRows":200},"rows":[["#ar.wikipedia",423,153605,2727],["#be.wikipedia",33,46815,1235],["#bg.wikipedia",75,41674,528],["#ca.wikipedia",478,112482,1651],["#ce.wikipedia",60,83925,135],["#cs.wikipedia",222,132768,1443],["#da.wikipedia",96,44879,1097],["#de.wikipedia",2523,522625,35407],["#el.wikipedia",251,31400,9530],["#en.wikipedia",11549,3045299,176483],["#eo.wikipedia",22,13539,2],["#es.wikipedia",1256,634670,15983],["#et.wikipedia",52,2758,483],["#eu.wikipedia",13,6690,43],["#fa.wikipedia",219,74733,2798],["#fi.wikipedia",244,54810,2590],["#fr.wikipedia",2099,642555,22487],["#gl.wikipedia",65,12483,526],["#he.wikipedia",246,51302,3533],["#hi.wikipedia",19,34977,60],["#hr.wikipedia",22,25956,204],["#hu.wikipedia",289,166101,2077],["#hy.wikipedia",153,39099,4230],["#id.wikipedia",110,119317,2245],["#it.wikipedia",1383,711011,12579],["#ja.wikipedia",749,317242,21380],["#kk.wikipedia",9,1316,31],["#ko.wikipedia",533,66075,6281],["#la.wikipedia",33,4478,1542],["#lt.wikipedia",20,14866,242],["#min.wikipedia",1,2,0],["#ms.wikipedia",11,21686,556],["#nl.wikipedia",445,145634,6557],["#nn.wikipedia",26,33745,0],["#no.wikipedia",169,51385,1146],["#pl.wikipedia",565,138931,8459],["#pt.wikipedia",472,229144,8444],["#ro.wikipedia",76,28892,1224],["#ru.wikipedia",1386,640698,19612],["#sh.wikipedia",14,6935,2],["#simple.wikipedia",39,43018,546],["#sk.wikipedia",33,12188,72],["#sl.wikipedia",21,3624,266],["#sr.wikipedia",168,72992,2349],["#sv.wikipedia",244,42145,3116],["#tr.wikipedia",208,67193,1126],["#uk.wikipedia",263,137420,1959],["#uz.wikipedia",983,13486,8],["#vi.wikipedia",9747,295972,1388],["#war.wikipedia",1,0,0],["#zh.wikipedia",1126,191033,7916]]}
  • POST Maha Reporting Request for example student schema with TimeShift Curator MahaRequest will look like following, you need to pass cube name, list of fields you want to fetch, filters, sorting columns in the base request and timeshift curator configs (daysOffset is an day offset for requesting previous period's to and from dates)
{
 "cube": "student_performance",
 "selectFields": [
    {
       "field": "Student ID"
    },
    {
       "field": "Class ID"
    },
    {
       "field": "Section ID"
    },
    {
       "field": "Total Marks"
    }
 ],
 "filterExpressions": [
    {
       "field": "Day",
       "operator": "between",
       "from": "2019-10-20",
       "to": "2019-10-29"
    },
    {
       "field": "Student ID",
       "operator": "=",
       "value": "213"
    }
 ],
"curators": {
  "timeshift": {
    "config" : {
      "daysOffset": 0 
    }
  }
}
}    

please note that we have loaded the test data for demo in current day and day before. For timeshift curator demo, we have loaded data for 11 days back of current date. Please make sure that you update the requested to and from dates according to current dates.

Curl command :

curl -H "Content-Type: application/json" -H "Accept: application/json" -X POST -d @student.json http://localhost:8080/mahademo/registry/student/schemas/student/query?debug=true 

Sync Output :

{
    "header": {
        "cube": "student_performance",
        "fields": [
            {
                "fieldName": "Student ID",
                "fieldType": "DIM"
            },
            {
                "fieldName": "Class ID",
                "fieldType": "DIM"
            },
            {
                "fieldName": "Section ID",
                "fieldType": "DIM"
            },
            {
                "fieldName": "Total Marks",
                "fieldType": "FACT"
            },
            {
                "fieldName": "Total Marks Prev",
                "fieldType": "FACT"
            },
            {
                "fieldName": "Total Marks Pct Change",
                "fieldType": "FACT"
            }
        ],
        "maxRows": 200,
        "debug": {}
    },
    "rows": [
        [
            213,
            198,
            100,
            120,
            98,
            22.45
        ],
        [
            213,
            200,
            100,
            125,
            110,
            13.64
        ]
    ]
}
  • POST Maha Reporting Request for example wiki schema with Total metrics curator

Request :

{
   "cube": "wikiticker_stats",
   "selectFields": [
      {
         "field": "Wiki Channel"
      },
      {
         "field": "Total Count"
      },
      {
         "field": "Added Count"
      },
      {
         "field": "Deleted Count"
      }
   ],
   "filterExpressions": [
      {
         "field": "Day",
         "operator": "between",
         "from": "2015-09-11",
         "to": "2015-09-13"
      }
   ],
   "curators": {
      "totalmetrics": {
         "config": {}
      }
   }
}

In druid quick-start tutorial, wikipedia data is loaded for 2015-09-12, thus no change in the requested dates here.

Curl :

      curl -H "Content-Type: application/json" -H "Accept: application/json" -X POST -d @wikiticker.json http://localhost:8080/mahademo/registry/wiki/schemas/wiki/query?debug=true  

Output :

{
    "header": {
        "cube": "wikiticker_stats",
        "fields": [
            {
                "fieldName": "Wiki Channel",
                "fieldType": "DIM"
            },
            {
                "fieldName": "Total Count",
                "fieldType": "FACT"
            },
            {
                "fieldName": "Added Count",
                "fieldType": "FACT"
            },
            {
                "fieldName": "Deleted Count",
                "fieldType": "FACT"
            }
        ],
        "maxRows": 200,
        "debug": {}
    },
    "rows": [
        [
            "#ar.wikipedia",
            0,
            153605,
            2727
        ],
        [
            "#be.wikipedia",
            0,
            46815,
            1235
        ],
        [
            "#bg.wikipedia",
            0,
            41674,
            528
        ],
        [
            "#ca.wikipedia",
            0,
            112482,
            1651
        ],
        ... trimming other rows 
    ],
    "curators": {
        "totalmetrics": {
            "result": {
                "header": {
                    "cube": "wikiticker_stats",
                    "fields": [
                        {
                            "fieldName": "Total Count",
                            "fieldType": "FACT"
                        },
                        {
                            "fieldName": "Added Count",
                            "fieldType": "FACT"
                        },
                        {
                            "fieldName": "Deleted Count",
                            "fieldType": "FACT"
                        }
                    ],
                    "maxRows": -1,
                    "debug": {}
                },
                "rows": [
                    [
                        0,
                        9385573,
                        394298
                    ]
                ]
            }
        }
    }
}

Maha JDBC Query Layer (Example dbeaver configuration)

Maha is currently queryable by json REST APIs. We have exposed the standard JDBC interface to query maha so that users can use any other tool like SQL Labs/ dbeaver /Any other Database IDE that you like to query maha.
Users will be agnostic about which engine maha sql query will be fetching the data from and able to get the data back seamlessly without any code change from client side.
This feature is powered by Apache Calcite for sql parsing and Avatica JDBC for exposing the JDBC server.

Screen Shot 2022-05-26 at 9 32 40 PM

You can follow the below steps to configure your local explorer and query maha jdbc.

  1. Please follow the above steps and keep your api-example server running. It exposes this endpoint http://localhost:8080/mahademo/registry/student/schemas/student/sql-avatica to be used by avatica jdbc connection.
  2. Optionally you can run docker run -p 8080:8080 -it pranavbhole/pbs-docker-images:maha-api-example and it starts the maha-example-api server in local and you can skip step 1.
  3. Download the community version of DBeaver from https://dbeaver.io/
  4. Go to Driver Manager and Coonfigure Avatica Jar with the following settings as shown in the screenshot.
JDBC URL =  jdbc:avatica:remote:url=http://localhost:8080/mahademo/registry/student/schemas/student/sql-avatica
Driver Class Name =  org.apache.calcite.avatica.remote.Driver
  1. Mostly Avatica driver is backward compatible, we used the https://mvnrepository.com/artifact/org.apache.calcite.avatica/avatica-core/1.17.0 for demo.
  2. Example queries:

DESCRIBE student_performance;

SELECT 'Student ID', 'Total Marks', 'Student Name', 'Student Status' ,'Admitted Year',
 'Class ID' FROM student_performance where 'Student ID' = 213
 ORDER BY 'Total Marks' DESC;

Screen Shot 2022-05-26 at 9 26 11 PM

Screen Shot 2022-05-26 at 9 25 47 PM

Screen Shot 2022-05-26 at 9 53 24 PM

Screen Shot 2022-05-26 at 10 50 22 PM

Screen Shot 2022-05-26 at 10 52 23 PM

Screen Shot 2022-05-26 at 10 53 58 PM

Screen Shot 2022-05-26 at 10 54 16 PM

Presentation of 'Maha' at Bay Area Hadoop Meetup held on 29th Oct 2019:

'Maha' at Bay Area Hadoop Meetup held on 29th Oct 2019

Contributions

  • Hiral Patel
  • Pavan Arakere Badarinath
  • Pranav Anil Bhole
  • Shravana Krishnamurthy
  • Jian Shen
  • Shengyao Qian
  • Ryan Wagner
  • Raghu Kumar
  • Hao Wang
  • Surabhi Pandit
  • Parveen Kumar
  • Santhosh Joshi
  • Vivek Chauhan
  • Ravi Chotrani
  • Huiliang Zhang
  • Abhishek Sarangan
  • Jay Yang
  • Ritvik Jaiswal
  • Ashwin Tumma
  • Ann Therese Babu
  • Kevin Chen
  • Priyanka Dadlani

Acknowledgements

  • Oracle Query Optimizations
    • Remesh Balakrishnan
    • Vikas Khanna
  • Druid Query Optimizations
    • Eric Tschetter
    • Himanshu Gupta
    • Gian Merlino
    • Fangjin Yang
  • Hive Query Optimizations
    • Seshasai Kuchimanchi

maha's People

Contributors

anntb avatar ashwintumma23 avatar csnehith310 avatar dependabot[bot] avatar gitter-badger avatar joshisanthosh avatar kevinchen3516 avatar lgadde avatar mcjyang avatar naishu avatar nickyang724 avatar njalterio avatar panditsurabhi avatar patelh avatar pavanab4u avatar pranavbhole avatar priyankadadlani avatar raghukumark avatar rajagopaalan avatar rajagopaalans avatar ritvikjaiswal avatar ryankwagner avatar sapphirew avatar seshasai73 avatar shek1608 avatar shravanak avatar sidsingh1986 avatar vinayvanam20 avatar zhaoqiwang0605 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maha's Issues

Druid filter on facts not included in the list of request columns

We have a Druid query use case where we want to filter out rows where an aggregated metric is less than a specified threshold, but don't want that metric's column included in the list of output columns.

This is possible with dimension columns but not with fact columns. Only fact columns in the list of request columns are added to the list of fact aggregators.

    val factCols = queryContext.factBestCandidate.factColMapping.toList.collect {
      case (nonFkCol, alias) if queryContext.factBestCandidate.requestCols(nonFkCol) =>
        (fact.columnsByNameMap(nonFkCol), alias)
    }

https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L1143

Trying to understand why we need the queryContext.factBestCandidate.requestCols(nonFkCol) check. Would there be any issues if we removed it in order to support filtering on non output fact columns?

Maha Validate query request before actually query

Hi,
We are making a API endpoint which checks whether a request query to maha is valid or not. The query looks like this:
`{
"cube":"fake_cube",
"selectFields":[
{
"field":"fake ID"
}
],
"filterExpressions":[
{
fake filter
},

],
"sortBy":[
{
"field":"fake Id",
"order":"Desc"
}
],
"rowsPerPage":10
}`

We want to check whether the field, cube , filter etc are right. Is there any method that can validate the request before process the query?
Thanks!

MacOS Big Sur breaks PostgreSQL installations causing maha failure

This is a small tip that maybe helpful if someone upgrade macOS from Catalina to Big Sur in the future.

Issue

Big Sur may break PostgreSQL installations, which will cause mvn clean install failed in maha with the number of failure test showing 0.

The following is the error when running mvn clean install -X:
Run completed in 17 seconds, 558 milliseconds. Total number of tests run: 1346 Suites: completed 79, aborted 2 Tests: succeeded 1346, failed 0, canceled 0, ignored 4, pending 0 *** 2 SUITES ABORTED *** [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for maha parent 6.37-SNAPSHOT: [INFO] [INFO] maha parent ........................................ SUCCESS [ 3.482 s] [INFO] maha-par-request-2 ................................. SUCCESS [ 57.452 s] [INFO] maha db ............................................ SUCCESS [ 40.056 s] [INFO] maha request log ................................... SUCCESS [ 33.018 s] [INFO] maha-druid-lookups ................................. SUCCESS [ 48.361 s] [INFO] maha core .......................................... FAILURE [01:27 min] [INFO] maha job service ................................... SKIPPED [INFO] maha druid executor ................................ SKIPPED [INFO] maha oracle executor ............................... SKIPPED [INFO] maha presto executor ............................... SKIPPED [INFO] maha postgres executor ............................. SKIPPED [INFO] maha service ....................................... SKIPPED [INFO] maha worker ........................................ SKIPPED [INFO] maha api-jersey .................................... SKIPPED [INFO] maha api-example ................................... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 04:30 min [INFO] Finished at: 2021-03-02T11:03:49-08:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (test) on project maha-core: There are test failures -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (test) on project maha-core: There are test failures at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289) at org.apache.maven.cli.MavenCli.main (MavenCli.java:193) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures at org.scalatest.tools.maven.TestMojo.execute (TestMojo.java:107) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289) at org.apache.maven.cli.MavenCli.main (MavenCli.java:193) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <args> -rf :maha-core

After check, the failure happens in PostgresQueryGeneratorTest at the following step:
https://github.com/yahoo/maha/blob/master/core/src/test/scala/com/yahoo/maha/core/query/postgres/PostgresQueryGeneratorTest.scala#L38

Solution

Running brew install postgresql will resolve the issue

Reference

opentable/otj-pg-embedded#136

Rollup Best Candidate Calc Inconsistent with Equal Cost

Found that given two rollups:
(dr_stats_hourly,Druid,1600,9993,8675309)

(dr_teacher_stats_hourly,Druid,1600,9992,8675309)

With same cost (1600) and Engine (Druid), Level is not factored into old logic:
if (a._2 == b._2) { if (a._4 == b._4) { a._3 < b._3 } else { a._4 < b._4 } } else { if (a._5 == b._5) { a._3 < b._3 } else { a._5 < b._5 } }

This case drops into
if (a._5 == b._5) { a._3 < b._3 }

This if-clause which causes errors.

Adding in the same check made on the IF case fixes this.

#765

Aggregate Dim Col Feature

Requirement is that we want to query count of section_id that student belongs to. Currently maha does not support the rollup expressions on the dimension columns in the fact.

Example query is:

student_id, count(distinct section_id) as NumberOfSections
from student_perf 
group by student_id

Plan is to create a new derived column called DerAaggregatedDimCol which will have rollup expression along with the derived expression. And respective engine query generator will take care of rendering it. Tricky part is to exclude it from the group by expressions only if base col is not used by any other requested cols or dependent derived cols.
As this is experimental feature, i am planning to start it from hive/presto query generator.

let me know if you have any suggestions and questions.

using druid maha lookups as a replacement for lookups-cached-global

Hi ,
Currently we are using lookups-cached-global extension for loading lookups in druid(version - 0.12.3).We load lookups from different Mssql and Msql servers.We load around 50-100 lookups of which the top 10 have around 10-15 million entries.Because of such huge size of lookups we are having a lot of issues(high gc pauses,not able to query) while loading lookups on historicals and brokers.So I would like to use your extension as a replacement for lookups-cached-global.
Are there any queries that could be affected ?
Do you support extracting lookups from msql servers?

Make the RocksDB druid lookup serialization generic

Right now we are explicitly using Protocol Buffers for serialization for inmem druid lookups, which restricts us from using different serialization formats. Create a generic interface so we can have multiple implementations.

[Feature] Dim Only Query: Schema required filter validation in request

Right now, request model is not validating the schema required filters for Dim Only Query.
The code block to validate the schema required filters is only applied when we have fact best candidate defined. Adding this missing feature.

  • If you are querying dim only data then you need to have the schema required fields as filter.

  • If you are querying top level dim like ads or ad assets and then it will search all the related dimensions and create flatmap of schema required fields and assert on it. (Had to do this because, In some case we specify schema required fields only in the Lower level dim)

  • Also asserting that Schema required filter can only have In Or = filter to pass criteria, so that no request can run NOT IN filter on Advertiser ID and get all data.

  • Exposing common methods in the registry.

Combining data from datasources based on date range

Hi all,
I have a usecase where we'd like to combine data from separate datasources based on the timerange of the query.
We have a datasource with realtime data( latest 24 hrs) and separate one with historical data(older than 24 hrs).

Is it possible to setup this in a way that if a query comes in for last 14 days, Maha would query realtime data for latest day and historical data for rest of the time range and then combine those results??

Similarly along that line, the decision to choose the preferred dataset could be based on 'data availability' and not just on strict date ranges.

Support for really large queries

Hi,
We are running into OOM errors for large druid queries we run in Maha The response size are huge upwards of 10GBs to even 200GB in some cases. It seems that Druid response is being loaded completely in Memory.
Is there a way to use streaming or spill these large results to disk?

Support for a doubleSum aggregator in Druid

Getting the following error when defining a Fact col that is a DecType
FactCol("revenue", DecType(10,8))

"errorMessage" : "Could not resolve type id 'roundingDoubleSum' into a subtype of [simple type, class org.apache.druid.query.aggregation.AggregatorFactory]: known type ids = [AggregatorFactory, HLLSketchBuild, HLLSketchMerge, approxHistogram, approxHistogramFold, arrayOfDoublesSketch, cardinality, count, doubleFirst, doubleLast, doubleMax, doubleMin, doubleSum, filtered, floatFirst, floatLast, floatMax, floatMin, floatSum, histogram, hyperUnique, javascript, longFirst, longLast, longMax, longMin, longSum, quantilesDoublesSketch, quantilesDoublesSketchMerge, sketchBuild, sketchMerge, stringFirst, stringFirstFold, stringLast, stringLastFold, thetaSketch]\n at [Source: HttpInputOverHTTP@23997a31[c=1080,q=0,[0]=null,s=STREAM]; line: 1, column: 692] (through reference chain: org.apache.druid.query.groupby.GroupByQuery["aggregations"]->java.util.ArrayList[0])",

[WIP] Dynamically generated versioned schema

One of the limitations of maha is how we expect to define the schema. This is a primarily due to how we defined the same in the older version of maha which were not open sourced. The schema was predefined and used in runtime. In order to add a new column or derived column, a code change was required. The pro of this is it goes through code review and testing before pushing it to production. The con is every change has to go through the same process regardless of how simple the change may be; e.g. something as simple as adding a new metric column with no specializations. Another limitation of maha is versioning of schema, a request may be validated and queued using a different version of the schema then when a worker picks up the request for processing in the asynchronous processing use case. This could result in failures if schema changes are not backward compatible (e.g. removal of a column).

One approach to solving both of these problems would be to generate schema dynamically on start up. This could introduce a little latency to start up process but reduce burden on adding new changes without having to make code changes. However, we would need to version the changes and support rollback to prior version. In some circumstances, the change may be complex enough to warrant a predefined hook to be invoked on the builder after dynamically constructing the schema.

A schema may be dynamically constructed using these versioned configurations:

  1. Database Schema Dump
  2. Overrides and derived column configurations on top of table definitions
  3. Predefined hook to be invoked on the builder

In order to support versioning and publishing of updates with rollback we'd need to support the following:

  1. Versioned schema dump, overrides and derived column definitions, and predefined hook libs
  2. Ability to define the current version / latest published version (a versioned triple of the 3 configurations above)
  3. Ability to export a version such that it can be imported into a downstream environment (e.g. staging -> production)
  4. Attach validation tests for each versioned export so they can be used for validation after import

In order to support the above, we'd need the following:

  1. Define a data model that supports the above requirements
  2. API to allow CRUD on the data model (dev/staging environment)
  3. API to support export/import/validation workflow (dev/staging/prod environment)

Each environment would have its own independently managed versioned triples. The individual components may be versioned in dev environment and subsequently pushed to staging and prod.

WIP

Integration issue with finatra at runtime

Seeing this runtime exception when I pull in maha artifacts.
Finatra uses netty 4.1.35.Final but somehow this class is getting pulled into the classpath from ning async-http-client as it pulls an older version of the DefaultChannelId class inside the jar.
impossibl/pgjdbc-ng#332

Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:112)
at io.netty.channel.AbstractChannel.(AbstractChannel.java:84)
at io.netty.bootstrap.FailedChannel.(FailedChannel.java:33)
at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:330)
at io.netty.bootstrap.AbstractBootstrap.doBind(AbstractBootstrap.java:282)
at io.netty.bootstrap.AbstractBootstrap.bind(AbstractBootstrap.java:278)
at com.twitter.finagle.netty4.ListeningServerBuilder$$anon$1.(ListeningServerBuilder.scala:172)
at com.twitter.finagle.netty4.ListeningServerBuilder.bindWithBridge(ListeningServerBuilder.scala:78)
at com.twitter.finagle.netty4.Netty4Listener.listen(Netty4Listener.scala:103)
at com.twitter.finagle.server.StdStackServer.newListeningServer(StdStackServer.scala:82)
at com.twitter.finagle.server.StdStackServer.newListeningServer$(StdStackServer.scala:69)
com.twitter.inject.server.EmbeddedTwitterServer.$anonfun$runNonExitingMain$2(EmbeddedTwitterServer.scala:623)
... 26 more

newRollUp and createSubset need to apply costMultiplier updates

newRollUp has an input field called costMultiplier, but the logic to apply this to its generated rollup isn't present. Need to create logic to do this.
Following:

def newRollUp(...
                , costMultiplier: Option[BigDecimal] = None
                ...
               ) : FactBuilder = {

Applies as such:

tableMap = tableMap +
        (name -> new FactTable(
          ...
          , fromTable.costMultiplierMap
          ...
      ))
      this

And should include this extra logic:

val remappedMultiplier: Map[RequestType, CostMultiplier] =
        fromTable.costMultiplierMap.map(f => (
          f._1 -> {
            val adjustedLRL: LongRangeLookup[BigDecimal] = LongRangeLookup(f._2.rows.list.map(row => (row._1, row._2 * costMultiplier.getOrElse(1))))
            CostMultiplier(adjustedLRL)
          }
          )
        )

Or similar.

MIP 1: Generate the join instead of in subquery when high cardinality dimension filters are requested

We found that high cardinality filter works better in the join clauses instead of fact only subquery clauses.

Right now we are using highCardinalityFilters only in disqualifying the druid engine. It would be good use to optimize the query based on the HighCardinalityFilters.

For example: Following query performs better than IN subquery if hcf is high cardinality filter.
select * from fact
left outer join ( select id from dim where hcf in ('A')) dm on (fact.id=dm.id);

select * from
fact where id in (select id from dim where hcf in ('A'));

Reduce JDBC Lookups for Oracle

Currently, JDBC lookups open a very large number of connections per node defined to make them, such that given an increasing number of nodes for query, the number of connections increases linearly.

This proposal is to create a new JDBC-based lookup type that offloads the inMemory data fetching into only a few LEADER nodes (3, for example) and delegates all other nodes into data reading (FOLLOWER).

Current architecture of JDBC Lookups:

  1. Create a thread, on which a periodic Job is launched (once per minute, for example)
  2. On that thread, scan a table for all data newer than current lastUpdated date.
  3. For all data, save it by key->value pair into an in memory cache (JDBC)

New architecture of JDBC Lookups (Borrows Kafka writing logic from MissingLookupManager):

LEADER LOGIC

  1. Create a thread, on which a periodic Job is launched (once per minute, for example)
  2. On that thread, scan a table for all data newer than current lastUpdated date.
  3. For all data, save it by key->value pair into a map
  4. By the same logic as MissingLookupManager, populate a Kafka Topic (KafkaProducer using Protobuf) with the map contents.

FOLLOWER LOGIC

  1. Create a thread, on which a KafkaConsumer is launched.
  2. On that thread, constantly poll the above Kafka Topic on the assigned node pool for updates.
  3. Use the same Protobuf as was used to populate the topic to read back into an in memory cache (JDBC).

Missing DefaultDimensionSpec in Druid query when querying with multiple Lookup dimensions.

When we query Maha with multiple lookup dimensions in the select fields we are seeing some of the fields returned as nulls in the results. The underlying druid query issued did not have these lookups listed in the default dimension specs.

To elaborate further
The issued maha query is of the following format.
Maha Query

{
  "cube": "test_cube",
  "rowsPerPage": 1000,
  "selectFields": [
    {
      "field": "Adunit ID"
    },
   {
      "field": "Adunit Name"  //Lookup based on Adunit ID
    },
    {
      "field": "Adgroup ID"
    },
   {
      "field": "Adgroup Name" //Lookup based on Adgroup ID
    },
    {
      "field": "Adserver Requests"
    }
  ],
  "filterExpressions": [
    {
      "field": "Publisher ID",
      "operator": "=",
      "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
    },
    {
      "field": "Day",
      "operator": "Between",
      "from": "2020-09-17",
      "to": "2020-09-17"
    }
  ]
}

Output

{
  "header": {
    "cube": "test_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Druid Query Created by Maha

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "test_cube"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2020-09-17T00:00:00.000Z/2020-09-18T00:00:00.000Z"
    ]
  },
  "virtualColumns": [],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "or",
        "fields": [
          {
            "type": "selector",
            "dimension": "__time",
            "value": "2020-09-17",
            "extractionFn": {
              "type": "timeFormat",
              "format": "YYYY-MM-dd",
              "timeZone": "UTC",
              "granularity": {
                "type": "none"
              },
              "asMillis": false
            }
          }
        ]
      },
      {
        "type": "selector",
        "dimension": "pubId",
        "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
      }
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [     // No adgroup_id added here. 
    {
      "type": "default",
      "dimension": "adunitId",
      "outputName": "Adunit ID",
      "outputType": "STRING"
    },
    {
      "type": "extraction",
      "dimension": "adgroupId",
      "outputName": "Adgroup Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adgroup_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    },
    {
      "type": "extraction",
      "dimension": "adunitId",
      "outputName": "Adunit Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adunit_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "Adserver Requests",
      "fieldName": "adserverRequests"
    }
  ],
  "postAggregations": [],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 10000000
  },
  "context": {
    "groupByStrategy": "v2",
    "applyLimitPushDown": "false",
    "implyUser": "internal_user",
    "priority": 10,
    "userId": "internal_user",
    "uncoveredIntervalsLimit": 1,
    "groupByIsSingleThreaded": true,
    "timeout": 900000,
    "queryId": "9292389f-2a7f-4e12-a39a-6f727097ab92"
  },
  "descending": false
}

Upon debugging further I stumbled upon the variable factRequestCols(Set of Strings) at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L353 which is being passed on to method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 where druid queries dimension specs are being created in getDimensions method based on factRequestCols passed.

I am not quite sure I understand the logic of factRequestCols set creation here but adgroup_id dimension is not getting included in the resulting set because of which it is not getting added to Druid query dimension spec as well.

I locally overrode the code by passing queryContext.factBestCandidate.requestCols to getDimensions method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 and it fixed the issue and started populating the dimension spec in druid query as well as had adgroup_id values in the resulting out.

Output after the change

{
  "header": {
    "cube": "platform_performance_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      "dddddddddddAdgroup1dddddddddddddd",
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      "dddddddddddAdgroup2dddddddddddddd",
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Could you please help me with the context of factRequestCols set creation and also let me know if the logic of the factRequestCols set creation or anything else needs to be changed to include the missing dimensions.

Presto cast to double for integer division

Integer division returns integer answer, anything after the decimal is dropped. This is to rectify that behavior and make it similar to hive which handles this internally.

Hive functional testing/ Query validation framework

I was going exploring hive functional testing framework, found that validator is yahoo's frameworks and fit into maha needs. We can actually give thought about integrating this into maha itself as it can work as query validator. Currently we do not have unit test for hive in maha which actually runs the query on hive data source. Validator can work as query syntax validator and kind of functional testing for hive query generator.

https://github.com/yahoo/validatar

MultiEngineQuery Collapses Non-Pk Fact Table Dimension Rows

This issue is illustrated here:
#447

It is most noticeable when the column is Mapped (Pricing Type, for example), which makes query behavior inconsistent with the Reporting Request.

Given a Dimension located in the fact with multiple expected useful output values, this dimension gets collapsed as it is not the most granular key alias on the table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.