Apache Kylin v1.5.2 Release Announcement

May 26, 2016, 8:00 am

≪ Previous: Apache Kylin增加对Tableau 9及微软Excel, Power BI的支持

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.5.2.

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.

To download Apache Kylin v1.5.2 source code or binary package:
please visit the download page.

This is a major release which brings more stable, robust and well management version, Apache Kylin community resolved about 76 issues including bug fixes, improvements, and few new features.

Change Highlights

New Feature

Support Hive View as Lookup Table KYLIN-1077
Make Kylin run on MapR KYLIN-1515
Download diagnosis zip from GUI KYLIN-1600
support kylin on cdh 5.7 KYLIN-1672

Improvement

Enhance mail notification KYLIN-869
HiveColumnCardinalityJob should use configurations in conf/kylin_job_conf.xml KYLIN-955
Enable deriving dimensions on non PK/FK KYLIN-1313
Improve performance of converting data to hfile KYLIN-1323
Tools to extract all cube/hybrid/project related metadata to facilitate diagnosing/debugging/* sharing KYLIN-1340
change RealizationCapacity from three profiles to specific numbers KYLIN-1381
quicker and better response to v2 storage engine’s rpc timeout exception KYLIN-1391
Memory hungry cube should select LAYER and INMEM cubing smartly KYLIN-1418
For GUI, to add one option “yyyy-MM-dd HH:MM:ss” for Partition Date Column KYLIN-1432
cuboid sharding based on specific column KYLIN-1453
attach a hyperlink to introduce new aggregation group KYLIN-1487
Move query cache back to query controller level KYLIN-1526
Hfile owner is not hbase KYLIN-1542
Make hbase encoding and block size configurable just like hbase compression KYLIN-1544
Refactor storage engine(v2) to be extension friendly KYLIN-1561
Add and use a separate kylin_job_conf.xml for in-mem cubing KYLIN-1566
Front-end work for KYLIN-1557 KYLIN-1567
Coprocessor thread voluntarily stop itself when it reaches timeout KYLIN-1578
IT preparation classes like BuildCubeWithEngine should exit with status code upon build * exception KYLIN-1579
Use 1 byte instead of 8 bytes as column indicator in fact distinct MR job KYLIN-1580
Specify region cut size in cubedesc and leave the RealizationCapacity in model as a hint KYLIN-1584
make MAX_HBASE_FUZZY_KEYS in GTScanRangePlanner configurable KYLIN-1585
show cube level configuration overwrites properties in CubeDesigner KYLIN-1587
enabling different block size setting for small column families KYLIN-1591
Add “isShardBy” flag in rowkey panel KYLIN-1599
Need not to shrink scan cache when hbase rows can be large KYLIN-1601
User could dump hbase usage for diagnosis KYLIN-1602
Bring more information in diagnosis tool KYLIN-1614
Use deflate level 1 to enable compression “on the fly” KYLIN-1621
Make the hll precision for data samping configurable KYLIN-1623
HyperLogLogPlusCounter will become inaccurate when there’re billions of entries KYLIN-1624
GC log overwrites old one after restart Kylin service KYLIN-1625
add backdoor toggle to dump binary cube storage response for further analysis KYLIN-1627

Bug

column width is too narrow for timestamp field KYLIN-989
cube data not updated after purge KYLIN-1197
Can not get more than one system admin email in config KYLIN-1305
Should check and ensure TopN measure has two parameters specified KYLIN-1551
Unsafe check of initiated in HybridInstance#init() KYLIN-1563
Select any column when adding a custom aggregation in GUI KYLIN-1569
Unclosed ResultSet in QueryService#getMetadata() KYLIN-1574
NPE in Job engine when execute MR job KYLIN-1581
Agg group info will be blank when trying to edit cube KYLIN-1593
columns in metric could also be in filter/groupby KYLIN-1595
UT fail, due to String encoding CharsetEncoder mismatch KYLIN-1596
cannot run complete UT at windows dev machine KYLIN-1598
Concurrent write issue on hdfs when deploy coprocessor KYLIN-1604
Cube is ready but insight tables not result KYLIN-1612
UT ‘HiveCmdBuilderTest’ fail on ‘testBeeline’ KYLIN-1615
Can’t find any realization coursed by Top-N measure KYLIN-1619
sql not executed and report topN error KYLIN-1622
Web UI of TopN, “group by” column couldn’t be a dimension column KYLIN-1631
Unclosed OutputStream in SSHClient#scpFileToLocal() KYLIN-1634
Sample cube build error KYLIN-1637
Unclosed HBaseAdmin in ToolUtil#getHBaseMetaStoreId() KYLIN-1638
Wrong logging of JobID in MapReduceExecutable.java KYLIN-1639
Kylin’s hll counter count “NULL” as a value KYLIN-1643
Purge a cube, and then build again, the start date is not updated KYLIN-1647
java.io.IOException: Filesystem closed - in Cube Build Step 2 (MapR) KYLIN-1650
function name ‘getKylinPropertiesAsInputSteam’ misspelt KYLIN-1655
Streaming/kafka config not match with table name KYLIN-1660
tableName got truncated during request mapping for /tables/tableName KYLIN-1662
Should check project selection before add a stream table KYLIN-1666
Streaming table name should allow enter “DB.TABLE” format KYLIN-1667
make sure metadata in 1.5.2 compatible with 1.5.1 KYLIN-1673
MetaData clean just clean FINISHED and DISCARD jobs,but job correct status is SUCCEED KYLIN-1678
error happens while execute a sql contains ‘?’ using Statement KYLIN-1685
Illegal char on result dataset table KYLIN-1688
KylinConfigExt lost base properties when store into file KYLIN-1721
IntegerDimEnc serialization exception inside coprocessor KYLIN-1722

Upgrade

Data and metadata of this version is back compatible with v1.5.1, but may need to redeploy hbase coprocessor.

Support

Any issue or question, please
open JIRA to Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!

↧

Apache Kylin v1.5.2 正式发布

May 26, 2016, 8:00 am

≫ Next: RAW measure in Apache Kylin

≪ Previous: Apache Kylin v1.5.2 Release Announcement

Apache Kylin社区非常高兴宣布Apache Kylin v1.5.2正式发布。

Apache Kylin是一个开源的分布式分析引擎，提供Hadoop之上的SQL查询接口及多维分析（OLAP）能力以支持超大规模数据，最初由eBay Inc. 开发并贡献至开源社区。

下载Apache Kylin v1.5.2源代码及二进制安装包，
请访问下载页面.

这是一个主要的版本发布带来了更稳定，健壮及更好管理的版本，Apache Kylin社区解决了75个issue，包括Bug修复，功能增强及一些新特性等。

主要变化

新功能

支持Hive视图作为Lookup表 KYLIN-1077
使Kylin运行在MapR环境中 KYLIN-1515
通过GUI下载诊断zip包 KYLIN-1600
使Kylin支持cdh5.7 KYLIN-1672

改进

增强邮件通知 KYLIN-869
HiveColumnCardinalityJob应该使用conf/kylin_job_conf.xml中的配置 KYLIN-955
在非PK/FK上支持继承的维度 KYLIN-1313
增强转换数据到HFile阶段的性能 KYLIN-1323
抽取cube/hybrid/project相关元数据信息以便于诊断/调试/分享等用途 KYLIN-1340
把RealizationCapacity从3套配置改成特定数字 KYLIN-1381
更快更好的响应以应对v2存储引擎中的rpc超时异常 KYLIN-1391
内存需求较大的Cube应该更智能地选择LAYER还是INMEM构建算法 KYLIN-1418
在GUI上，给分区时间列添加一个”yyyy-MM-dd HH:MM:ss”选项 KYLIN-1432
基于特定列进行Cuboid分片 KYLIN-1453
添加超链接介绍新的Aggregation Group KYLIN-1487
把查询缓存调整到查询控制器级别 KYLIN-1526
Hfile所有者不是hbase KYLIN-1542
使hbase编码和block size像hbase压缩一样可配置 KYLIN-1544
重构v2存储引擎使之对扩展更加友好 KYLIN-1561
为in-memory构建任务添加并使用一个单独kylin_job_conf.xml KYLIN-1566
KYLIN-1557前端工作 KYLIN-1567
协助利器线程在超时后自动停止 KYLIN-1578
IT测试如BuildCubeWithEngine等的准备阶段应该在出现异常后报错退出 KYLIN-1579
在Fact distinct的MR任务中用1个字节代替8字节作为列标识符 KYLIN-1580
在Cubedesc上指定Region切分size并使model中的RealizationCapacity仅仅作为提示 KYLIN-1584
使MAX_HBASE_FUZZY_KEYS在GTScanRangePlanner中变得可配置KYLIN-1585
在CubeDesigner显示Cube级别的配置覆盖 KYLIN-1587
对于小的列族可以使用不同的block size KYLIN-1591
在Rowkey面板添加”isShardBy”标志 KYLIN-1599
在hbase行很大的时候不需要缩小扫描缓存 KYLIN-1601
用户应该可以到处hbase使用情况协助诊断 KYLIN-1602
为诊断工具添加更多信息 KYLIN-1614
在协处理器中使用1级deflate压缩 KYLIN-1621
使数据采样时hll精度可配置 KYLIN-1623
当有十亿数据规模时HyperLogLogPlusCounter会变得不精确 KYLIN-1624
GC日志在重启后覆盖老文件 KYLIN-1625
添加调试接口以导出二进制cube存储情况以助于未来分析 KYLIN-1627

Bug

时间戳字段的列宽太小 KYLIN-989
cube数据在purge后没有更新 KYLIN-1197
不能在配置中获取超过一个的系统管理员邮箱 KYLIN-1305
应该检查并确保topn度量必须指定两个参数 KYLIN-1551
HybridInstance#init()中进行非安全性的初始化 KYLIN-1563
在GUI中添加一个自定义聚合时选择一个列 KYLIN-1569
QueryService#getMetadata()存在没有关闭的ResultSet KYLIN-1574
在Job engine中执行MR任务时报出NPE KYLIN-1581
当编辑Cube时聚合组信息会变空 KYLIN-1593
度量列可以出现在filter/groupby中 KYLIN-1595
字符串编码不一致导致UT失败 KYLIN-1596
在windows开发机不能完整执行单元测试 KYLIN-1598
部署协处理器时会出现hdfs并发写问题 KYLIN-1604
Cube已经就绪但是insight中的表没有记录 KYLIN-1612
单元测试’HiveCmdBuilderTest’在’testBeeline’失败 KYLIN-1615
因topn度量引起的找不到realization KYLIN-1619
sql无法执行并报出topn错误 KYLIN-1622
TopN界面,”group by”列不能使用一个维度列 KYLIN-1631
SSHClient#scpFileToLocal()有未关闭的OutputStream KYLIN-1634
样例Cube构建出错 KYLIN-1637
ToolUtil#getHBaseMetaStoreId()中有未关闭的HBaseAdmin KYLIN-1638
MapReduceExecutable.java中使用了错误的日志记录 KYLIN-1639
Kylin的hll计数器把null当做一个有效值 KYLIN-1643
Purge一个cube并再次构建，起始日期没有被更新 KYLIN-1647
java.io.IOException: Filesystem closed - 在Cube构建第二步(MapR) KYLIN-1650
函数名’getKylinPropertiesAsInputSteam’拼写错误 KYLIN-1655
Streaming/kafka配置和表名不匹配 KYLIN-1660
表名在和/tables/tableName做请求映射时被截断 KYLIN-1662
在添加steam表时应该检查project选择 KYLIN-1666
Streaming表名应该遵从”DB.TABLE”格式 KYLIN-1667
确保1.5.2和1.5.1的元数据兼容 KYLIN-1673
元数据清理工具只清理了FINISHED和DISCARD的任务，但是一个任务的正确状态是SUCCEED KYLIN-1678
当使用Statement时sql中包含问号会报错 KYLIN-1685
结果显示表格中有非法字符 KYLIN-1688
KylinConfigExt在存储到文件时丢失基本信息 KYLIN-1721
IntegerDimEnc在协处理器中有序列化异常 KYLIN-1722

升级

该版本的数据与元数据与v1.5.1完全兼容，但也许需要更新HBase协处理器.

支持

升级和使用过程中有任何问题，请：
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

↧

RAW measure in Apache Kylin

May 29, 2016, 5:30 pm

≫ Next: Deploy Apache Kylin with Standalone HBase Cluster

≪ Previous: Apache Kylin v1.5.2 正式发布

Introduction

RAW measure function is use to query the detail data on the measure column in Kylin.

Example data:

DT	SITE_ID	SELLER_ID	ITEM_COUNT
2016-05-01	0	SELLER-001	100
2016-05-01	0	SELLER-002	200
2016-05-02	1	SELLER-003	300
2016-05-02	1	SELLER-004	400
2016-05-03	2	SELLER-005	500

We design the cube desc is the DT,SITE_ID columns as dimensions, and SUM(ITEM_COUNT) as measure. So, the base cuboid data will like this:

Rowkey of base cuboid	SUM(ITEM_COUNT)
2016-05-01_0	300
2016-05-02_1	700
2016-05-03_2	500

For the first row in the base cuboid data, Kylin can extract the dimension column values 2016-05-01,0 from the HBase Rowkey, and in the measure cell will store the measure function’s aggregated results 300, we can’t get the raw value 100 and 200 which before the aggregation on the ITEM_COUNT column.

The RAW function is use to make the SQL:

SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE

to return the correct result:

DT	SITE_ID	ITEM_COUNT
2016-05-01	0	100
2016-05-01	0	200
2016-05-02	1	300
2016-05-02	1	400
2016-05-03	2	500

How to use

Choose the Kylin version 1.5.1+.
Like the above case, we can make the DT,SITE_ID as dimensions, and RAW(ITEM_COUNT)as measure.
After the cube build, you can use the SQL to query the raw data:

SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE WHERE SITE_ID = 0

Optimize

The column which define RAW measure will be encoded with dictionary by default. So, you must know you data’s cardinality and distribution characteristics.

As far as possible to define the value uniform distribution column to dimensions, this will make the measure cell value size more uniform and avoid data skew.
If choose the ultra high cardinality column to define RAW measure, you can try the following to avoid the dictionary build error:
1. Cut a big segment into several segments, if you were trying to build a large data set at once;
2. Set kylin.dictionary.max.cardinality in conf/kylin.properties to a bigger value (default is 5000000).

To be improved

Now, the maximum storage 1M values of RAW measure in one cuboid. If exceed 1M values, it will throw BufferOverflowException in the cube build. This will be optimized in the later release.
Only dimension column can use in WHERE condition, RAW measure column is not support.

Implement

Custom one aggregation function RAW implement, the function’s return type depends on the column type.
Make the RAW aggregation function to save the column raw data in the base cuboid data.
The HBase value cell will store the dictionary id of the raw data to save space.
The SQL which contains the RAW measure column will be routed to the base cuboid query.
Extract the raw data from base cuboid data with dimension values to assemble into a complete row when query.

↧

Deploy Apache Kylin with Standalone HBase Cluster

June 10, 2016, 10:30 am

≫ Next: Diagnosis Tool Introduction

≪ Previous: RAW measure in Apache Kylin

Introduction

Apache Kylin mainly use HBase to storage cube data. The performance of HBase cluster impacts on the query performance of Kylin directly. In common scenario, HBase is deployed with MR/Hive on one HDFS cluster, which makes that the resouces HBase used is limited, and the MR job affects the performance of HBase. These problems can be resolved with standalone HBase cluster, and Apache Kylin has support this deploy mode for now.

Enviroment Requirements

To enable standalone HBase cluster supporting, check the basic enviroments at first:

Deploy the main cluster and hbase cluster, make sure both works normally
Make sure Kylin Server can access both clusters using hdfs shell with fully qualifiered path
Make sure Kylin Server can submit MR job to main cluster, and can use hive shell to access data warehouse, make sure the configurations of hadoop and hive points to main cluster
Make sure Kylin Server can access hbase cluster using hbase shell, make sure the configuration of hbase points to hbase cluster
Make sure the job on main cluster can access hbase cluster directly

Configurations

Update the config kylin.hbase.cluster.fs in kylin.properties, with a value as the Namenode address of HBase Cluster, like hdfs://hbase-cluster-nn01.example.com:8020

Notice that the value should keep consistent with the Namenode address of root.dir on HBase Master node, to ensure bulkload into hbase.

Using NN HA

HDFS Namenode HA improved the availablity of cluster significantly, and maybe the HBase cluster enabled it. Apache Kylin doesn’t support the HA perfectly for now, and here’s the workaroud:

Add all dfs.nameservices related configs of HBase Cluster into hadoop/etc/hadoop/hdfs-site.xml in Kylin Server, to make sure that can access HBase Cluster using hdfs shell with nameservice path
Add all dfs.nameservices related configs of both two clusters into kylin_job_conf.xml, to make sure that the MR job can access hbase cluster with nameservice path

TroubleShooting

UnknownHostException occurs during Cube Building
It usually occurs with HBase HA nameservice config, please refer the above section “Using NN HA”
HFile BulkLoading Stucks for long time
Check the regionserver log, there should be lots of error log, with WrongFS exception. Make sure the namenode address in kylin.properites/kylin.hbase.cluster.fs and hbase master node hbase-site.xml/root.dir is same

↧

Diagnosis Tool Introduction

June 10, 2016, 4:20 pm

≫ Next: Apache Kylin v1.5.3 Release Announcement

≪ Previous: Deploy Apache Kylin with Standalone HBase Cluster

Introduction

Since Apache Kylin 1.5.2, there’s a diagnosis tool on Web UI, which aims to help Kylin admins to extract diagnostic information for fault analysis and performance tunning.

Project Diagnosis

When user met issues about query failure, slow queries, metadata management and so on, he could click the ‘Diagnosis’ button on System tabpage.

Several seconds later, a diagnosis package will be avaliable to download from web browser, which contains kylin logs, metadata, configuration etc. Users could extract the package and analyze. Also when users asking help from experts in his team, attaching the package would raise the communication effeiciency.

Job Diagnosis

When user met issues about jobs, such as cubing failed, slow job and so on, he could click the ‘Diagnosis’ button in the Job’s Action menu.

The same with Project Diagnosis, a diagnosis package will be downloaded from web browser, which contains related logs, MR job info, metadata etc. User could use it for analysis or ask for help.

↧

Apache Kylin v1.5.3 Release Announcement

July 28, 2016, 1:00 pm

≫ Next: Apache Kylin v1.5.3 正式发布

≪ Previous: Diagnosis Tool Introduction

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.5.3.

To download Apache Kylin v1.5.3 source code or binary package:
please visit the download page.

This is a major release which brings more stable, robust and well management version, Apache Kylin community resolved about 84 issues including bug fixes, improvements, and few new features.

Change Highlights

A better way to check hadoop job status KYLIN-1319
Global (and more scalable) dictionary KYLIN-1705
More stable and functional precise count distinct implements after KYLIN-1186 KYLIN-1379
Improve performance of MRv2 engine by making each mapper handles a configured number of records KYLIN-1656
Distribute source data by certain columns when creating flat table KYLIN-1677
Allow cube to override MR job configuration by properties KYLIN-1706
Allow non-admin user to edit ‘Advenced Setting’ step in CubeDesigner KYLIN-1731
Calculate all 0 (except mandatory) cuboids KYLIN-1747
Allow mandatory only cuboid KYLIN-1749
Couldn’t use View as Lookup when join type is “inner” KYLIN-1789
Exception inside coprocessor should report back to the query thread KYLIN-1645
minimize dependencies of JDBC driver KYLIN-1846
TopN measure support non-dictionary encoding KYLIN-1478

Upgrade

Follow the upgrade guide.

Support

Any issue or question, please
open JIRA to Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!

↧

Apache Kylin v1.5.3 正式发布

July 28, 2016, 2:00 pm

≫ Next: Use Count Distinct in Apache Kylin

≪ Previous: Apache Kylin v1.5.3 Release Announcement

Apache Kylin社区非常高兴宣布Apache Kylin v1.5.3正式发布。

下载Apache Kylin v1.5.3源代码及二进制安装包，
请访问下载页面.

这是一个主要的版本发布带来了更稳定，健壮及更好管理的版本，Apache Kylin社区解决了84个issue，包括Bug修复，功能增强及一些新特性等。

主要变化

采用标准API获取Hadoop任务的状态 KYLIN-1319
全局的（扩展性更好的）字典编码方法 KYLIN-1705
更稳定的精确去重(count distinct)度量 KYLIN-1379
通过指定每个Mapper处理纪录的数量，从而提高Cube构建性能 KYLIN-1656
在创建Hive平表时按某些列（UHC）列来分散数据 KYLIN-1677
允许在Cube级别覆盖MR任务的属性 KYLIN-1706
允许非管理员用户编辑修改Cube向导的“高级设置”页 KYLIN-1731
计算全0组合（mandantory维度除外） cuboids KYLIN-1747
允许全部维度都是mandatory KYLIN-1749
修复“当连接类型时inner时不能使用view做维度表”的问题 KYLIN-1789
HBase coprocessor出错时将Exception传回查询线程 KYLIN-1645
精简JDBC driver的依赖 KYLIN-1846
TopN度量支持使用非字典的编码方式 KYLIN-1478

升级

参见升级指南.

支持

升级和使用过程中有任何问题，请：
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

↧

Use Count Distinct in Apache Kylin

August 1, 2016, 11:30 am

≫ Next: Query Metrics in Apache Kylin

≪ Previous: Apache Kylin v1.5.3 正式发布

Since v.1.5.3

Background

Count Distinct is a commonly measure in OLAP analyze, usually used for uv, etc. Apache Kylin offers two kinds of count distinct, approximately and precisely, differs on resource and performance.

Approximately Count Distinct

Apache Kylin implements approximately count distinct using HyperLogLog algorithm, offered serveral precision, with the error rates from 9.75% to 1.22%.
The result of measure has theorically upper limit in size, as 2^N bytes. For the max precision N=16, the upper limit is 64KB, and the max error rate is 1.22%.
This implementation’s pros is fast caculating and storage resource saving, but can’t be used for precisely requirements.

Precisely Count Distinct

Apache Kylin also implements precisely count distinct based on bitmap. For the data with type tiny int(byte), small int(short) and int, project the value into the bitmap directly. For the data with type long, string and others, encode the value as String into a dict, and project the dict id into the bitmap.
The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the rusult is always right with any roll-up, even across segments.
This implementation’s pros is precesily result, without error, but needs more storage resources. One result size maybe hundreds of MB, when the count distinct value over millions.

Global Dictionary

Apache Kylin encode value into dictionay at the segment level by default. That means one same value in different segments maybe encoded into different id, which means the result of precisely count distinct maybe not correct.
We introduced Global Dictionary with ensurance that one same value always encode into same id in different segments, to resolve this problem. Meanwhile, the capacity of dict has expanded dramatically, upper to support 2G values in one dict. It can also be used to replace default dictionary which has 5M values limitation.
Current version has no UI for global dictionary yet, and the cube desc json shoule be modified to enable it:

"dictionaries": [
    {
          "column": "SUCPAY_USERID",
	  "reuse": "USER_ID",
          "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
    }
]

The column means the column which to be encoded, the builder specifies the dictionary builder, only org.apache.kylin.dict.GlobalDictionaryBuilder is available for now.
The ‘reuse` is used to optimize the dict of more than one columns based on one dataset, please refer the next section ‘Example’ for more details.
The global dictionay can’t be used for dimensiion encoding for now, that means if one column is used for dimension and count distinct measure in one cube, the dimension encoding should be others but not dict.

Example

Here’s some example data:
| DT | USER_ID | FLAG1 | FLAG2 | USER_ID_FLAG1 | USER_ID_FLAG2 |
| :———-: | :——: | :—: | :—: | :————-: | :————-: |
| 2016-06-08 | AAA | 1 | 1 | AAA | AAA |
| 2016-06-08 | BBB | 1 | 1 | BBB | BBB |
| 2016-06-08 | CCC | 0 | 1 | NULL | CCC |
| 2016-06-09 | AAA | 0 | 1 | NULL | AAA |
| 2016-06-09 | CCC | 1 | 0 | CCC | NULL |
| 2016-06-10 | BBB | 0 | 1 | NULL | BBB |

There’s basic columns DT, USER_ID, FLAG1, FLAG2, and condition columns USER_ID_FLAG1=if(FLAG1=1,USER_ID,null), USER_ID_FLAG2=if(FLAG2=1,USER_ID,null). Supposed the cube is builded by day, has 3 segments.

Without the global dictionay, the precisely count distinct in semgent is correct, but the roll-up acrros segments result is wrong. Here’s an example:

select count(distinct user_id_flag1) from table where dt in ('2016-06-08', '2016-06-09')

The result is 2 but not 3. The reason is that the dict in 2016-06-08 segment is AAA=>1, BBB=>1, and the dict in 2016-06-09 segment is CCC=> 1.
With global dictionary config as below, the dict became as AAA=>1, BBB=>2, CCC=>3, that will procude correct result.
"dictionaries": [ { "column": "USER_ID_FLAG1", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" } ]

Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a subset of USER_ID dataset, that made the dictionary re-using possible. Just encode the USER_ID dataset, and config USER_ID_FLAG1 and USER_ID_FLAG2 resue USER_ID dict:
"dictionaries": [ { "column": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" }, { "column": "USER_ID_FLAG1", "reuse": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" }, { "column": "USER_ID_FLAG2", "reuse": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" } ]

Performance Tunning

When using global dictionary and the dictionary is large, the step ‘Build Base Cuboid Data’ may took long time. That mainly caused by the dictionary cache loading and eviction cost, since the dictionary size is bigger than mapper memory size. To solve this problem, overwrite the cube configuration as following, adjust the mapper size to 8GB:
kylin.job.mr.config.override.mapred.map.child.java.opts=-Xmx8g kylin.job.mr.config.override.mapreduce.map.memory.mb=8500

Conclusions

Here’s some basically pricipal to decide which kind of count distinct will be used:
- If the result with error rate is acceptable, approximately way is always an better way
- If you need precisely result, the only way is precisely count distinct
- If you don’t need roll-up across segments, or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and consider the reuse column optimization

↧

Query Metrics in Apache Kylin

August 27, 2016, 10:30 am

≫ Next: New NRT Streaming in Apache Kylin

≪ Previous: Use Count Distinct in Apache Kylin

Apache Kylin support query metrics since 1.5.4. This blog will introduce why Kylin need query metrics, the concrete contents and meaning of query metrics, the daily function of query metrics and how to collect query metrics.

Background

When Kylin become an enterprise application, you must ensure Kylin query service is high availability and high performance, besides, you need to provide commitment of the SLA of query service to users, Which need Kylin to support query metrics.

Introduction

The query metrics have Server, Project, Cube three levels.

For example, QueryCount will have three kinds of metrics:
```
Hadoop:name=Server_Total,service=Kylin.QueryCount
Hadoop:name=learn_kylin,service=Kylin.QueryCount
Hadoop:name=learn_kylin,service=Kylin,sub=kylin_sales_cube.QueryCount

Server_Total is represent for a query server node,
learn_kylin is a project name,
kylin_sales_cube is a cube name.
```
### The Key Query Metrics

QueryCount: the total of query count.
QueryFailCount: the total of failed query count.
QuerySuccessCount: the total of successful query count.
CacheHitCount: the total of query cache hit count.
QueryLatency60s99thPercentile: the 99th percentile of query latency in the 60s.(there are 99th, 95th, 90th, 75th, 50th five percentiles and 60s, 360s, 3600s three time intervals in Kylin query metrics. the time intervals could set by kylin.query.metrics.percentiles.intervals, which default value is 60, 360, 3600)
QueryLatencyAvgTime，QueryLatencyIMaxTime，QueryLatencyIMinTime: the average, max, min of query latency.
ScanRowCount: the rows count of scan HBase, it’s like QueryLatency.
ResultRowCount: the result count of query, it’s like QueryLatency.

Daily Function

Besides providing SLA of query service to users, in the daily operation and maintenance, you could make Kylin query daily and Kylin query dashboard by query metrics. Which will help you know the rules, performance of Kylin query and analyze the Kylin query accident case.

How To Use

Firstly, you should set config kylin.query.metrics.enabled as true to collect query metrics to JMX.

Secondly, you could use arbitrary JMX collection tool to collect the query metrics to your monitor system. Notice that, The query metrics have Server, Project, Cube three levels, which was implemented by dynamic ObjectName, so you should get ObjectName by regular expression.

↧

New NRT Streaming in Apache Kylin

October 18, 2016, 10:30 am

≫ Next: Use Window Function and Grouping Sets in Apache Kylin

≪ Previous: Query Metrics in Apache Kylin

In 1.5.0 Apache Kylin introduces the Streaming Cubing feature, which can consume data from Kafka topic directly. This blog introduces how that be implemented, and this tutorial introduces how to use it.

While, that implementation was marked as “experimental” because it has the following limitations:

Not scalable: it starts a Java process for a micro-batch cube building, instead of leveraging any computing framework; If too many messages arrive at one time, the build may fail with OutOfMemory error;
May loss data: it uses a time window to seek the approximate start/end offsets on Kafka topic, which means too late/early arrived messages will be skipped; Then the query couldn’t ensure 100% accuracy.
Difficult to monitor: the streaming cubing is out of the Job engine’s scope, user can not monitor the jobs with Web GUI or REST API.
Others: hard to recover from accident, difficult to maintain the code, etc.

To overcome these limitations, the Apache Kylin team developed the new streaming (KYLIN-1726) with Kafka 0.10, it has been tested internally for some time, will release to public soon.

The new design is a perfect implementation under Kylin 1.5’s “plug-in” architecture: treat Kafka topic as a “Data Source” like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as other cubes. Figure 1 is a high level architecture of the new design.

Kylin New Streaming Framework Architecture

The adapter to read Kafka messages is modified from kafka-hadoop-loader, the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.

To overcome the “data loss” limitation, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: each segment has a “min” date/time and a “max” date/time; Kylin will scan all segments which matched with the queried time scope. Figure 2 illurates this.

Use Offset to Cut Segments

Other changes/enhancements are made in the new streaming:

Allow multiple segments being built/merged concurrently
Automatically seek start/end offsets (if user doesn’t specify) from previous segment or Kafka
Support embeded properties in JSON message
Add REST API to trigger streaming cube’s building
Add REST API to check and fill the segment holes

The integration test result is promising:

Scalability: it can easily process up to hundreds of million records in one build;
Flexibility: you can trigger the build at any time, with the frequency you want; for example: every 5 minutes in day time but every hour in night time, and even pause when you need do a maintenance; Kylin manages the offsets so it can automatically continue from the last position;
Stability: pretty stable, no OutOfMemoryError;
Management: user can check all jobs’ status through Kylin’s “Monitor” page or REST API;
Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes;

Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial in the future:
Streaming Job Monitoring

Streaming Adapter

Streaming Twitter Sample

In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the previous version). Nextly, the Apache Kylin team will move toward a Real Time engine.

↧

Use Window Function and Grouping Sets in Apache Kylin

November 16, 2016, 3:30 pm

≫ Next: Retention Or Conversion Rate Analyze in Apache Kylin

≪ Previous: New NRT Streaming in Apache Kylin

Since v.1.5.4

Background

We’ve provided window function and grouping sets in Apache Kylin, to support more complicate query, keeping SQL statements simple and clearly. Here’s the article about HOW TO use them.

Window Function

Window function give us the ability to partition, order and aggregate on the query result sets. We can use window function to meet complicated query requirements with much more simple SQL statements.
The window function syntax of Apache Kylin can be found in calcite reference, and is similar with Hive window function.
Here’s some examples:

sum(col) over()
count(col) over(partition by col2)
row_number() over(partition by col order by col2)
first_value(col) over(partition by col2 order by col3 rows 2 preceding)
lag(col, 5, 0) over(partition by col2, order by col3 range 3 preceding 6 following)

Grouping Sets

Sometimes we want aggregate data by different keys in one SQL statements. This is possible with the grouping sets feature.
Here’s example, suppose we execute one query and get result as below:

select dim1, dim2, sum(col) as metric1 from table group by dim1, dim2

dim1	dim2	metric1
A	AA	10
A	BB	20
B	AA	15
B	BB	30

If we also want to see the result with dim2 rolled up, we can rewrite the sql and get result as below:

select dim1,
case grouping(dim2) when 1 then 'ALL' else dim2 end,
sum(col) as metric1
from table
group by grouping sets((dim1, dim2), (dim1))

dim1	dim2	metric1
A	ALL	30
A	AA	10
A	BB	20
B	ALL	45
B	AA	15
B	BB	30

Apache Kylin support cube/rollup/grouping sets, and grouping functions can be found here. It’s also similar with Hive grouping sets.

↧

Retention Or Conversion Rate Analyze in Apache Kylin

November 28, 2016, 5:30 am

≫ Next: Apache Kylin v1.6.0 Release Announcement

≪ Previous: Use Window Function and Grouping Sets in Apache Kylin

Since v.1.6.0

Background

Retention or conversion rate is important in data analysis. In general, the value can be calculated based on the intersection of two data sets (uuid etc.), with some same dimensions (city, category, etc.) and one variety dimension (date etc.).
Apache Kylin has support retention calculation based on the Bitmap and UDAF intersect_count. This article introduced how to use this feature.

Usage

To use retention calculation in Apache Kylin, must meet requirements as below:
* Only one dimension can be variety
* The measure to be calculated have defined precisely count distinct measure

The intersect_count usage is described below:

intersect_count(columnToCount, columnToFilter, filterValueList)
`columnToCount` the columnt to cacluate and distinct count
`columnToFilter` the variety dimension
`filterValueList` the values of variety dimension, should be array

Here’s some examples:

intersect_count(uuid, dt, array['20161014', '20161015'])
The precisely distinct count of uuids shows up both in 20161014 and 20161015

intersect_count(uuid, dt, array['20161014', '20161015', '20161016'])
The precisely distinct count of uuids shows up all in 20161014, 20161015 and 20161016

intersect_count(uuid, dt, array['20161014'])
The precisely distinct count of uuids shows up in 20161014, equivalent to `count(distinct uuid)`

A complete sql statement example:

select city, version,
intersect_count(uuid, dt, array['20161014']) as first_day,
intersect_count(uuid, dt, array['20161015']) as second_day,
intersect_count(uuid, dt, array['20161016']) as third_day,
intersect_count(uuid, dt, array['20161014', '20161015']) as retention_oneday,
intersect_count(uuid, dt, array['20161014', '20161015', '20161016']) as retention_twoday
from visit_log
where dt in ('2016104', '20161015', '20161016')
group by city, version

Conclusions

Based on Bitmap and UDAF intersect_count, we can do fast and convenient retention analyze on Apache Kylin. Compared with the traditional way, SQL in Apache Kylin can be much more simple and clearly, and more efficient.

↧

Apache Kylin v1.6.0 Release Announcement

December 4, 2016, 12:00 pm

≫ Next: Apache Kylin v1.6.0 正式发布

≪ Previous: Retention Or Conversion Rate Analyze in Apache Kylin

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.6.0.

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

This is a major release after 1.5.4, with the reliable and scalable support for using Apache Kafka as data source; this enables user to build cubes directly from streaming data (without loading to Apache Hive), reducing the data latency from days/hours to minutes.

Apache Kylin 1.6.0 resolved 102 issues including bug fixes, improvements, and new features. All of the changes can be found in the release notes.

Change Highlights

Scalable streaming cubing KYLIN-1726
TopN counter merge performance improvement KYLIN-1917
Support Embedded Structure JSON Message KYLIN-1919
More robust approach to hive schema changes KYLIN-2012
TimedJsonStreamParser should support other time format KYLIN-2054
Add an encoder for Boolean type KYLIN-2055
Allowe concurrent build/refresh/merge KYLIN-2070
Support to change streaming configuration KYLIN-2082

To download Apache Kylin v1.6.0 source code or binary package, visit the download page.

Upgrade

Follow the upgrade guide.

Support

Any issue or question,
open JIRA to Apache Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!

↧

Apache Kylin v1.6.0 正式发布

December 4, 2016, 1:00 pm

≫ Next: By-layer Spark Cubing

≪ Previous: Apache Kylin v1.6.0 Release Announcement

Apache Kylin社区非常高兴宣布Apache Kylin v1.6.0正式发布。

Apache Kylin是一个开源的分布式分析引擎，提供Hadoop之上的SQL查询接口及多维分析（OLAP）能力，支持对超大规模数据进行秒级查询。

Apache Kylin v1.6.0带来了更可靠更易于管理的从Apache Kafka流中直接构建Cube的能力，使得用户可以在更多场景中更自然地进行数据分析，使得数据从产生到被检索到的延迟，从以前的一天或数小时，降低到数分钟。 Apache Kylin 1.6.0修复了102个issue，包括缺陷，改进和新功能，详见release notes.

主要变化

可伸缩的流式Cube构建 KYLIN-1726
TopN性能增强 KYLIN-1917
支持Kafka的嵌入格式的JSON消息 KYLIN-1919
可靠同步hive表模式更改 KYLIN-2012
支持更多Kafka消息的时间戳格式 KYLIN-2054
增加Boolean编码 KYLIN-2055
支持多segment并行构建／合并／刷新 KYLIN-2070
支持更新流式表模式和配置的修改 KYLIN-2082

下载Apache Kylin v1.6.0源代码及二进制安装包，请访问下载页面.

升级

参见升级指南.

支持

升级和使用过程中有任何问题，请：
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

↧

By-layer Spark Cubing

February 23, 2017, 9:30 am

≫ Next: Apache Kylin v2.0.0 beta 发布

≪ Previous: Apache Kylin v1.6.0 正式发布

Before v2.0, Apache Kylin uses Hadoop MapReduce as the framework to build Cubes over huge dataset. The MapReduce framework is simple, stable and can fulfill Kylin’s need very well except the performance. In order to get better performance, we introduced the “fast cubing” algorithm in Kylin v1.5, tries to do as much as possible aggregations at map side within memory, so to avoid the disk and network I/O; but not all data models can benefit from it, and it still runs on MR which means on-disk sorting and shuffling.

Now Spark comes; Apache Spark is an open-source cluster-computing framework, which provides programmers with an application programming interface centered on a data structure called RDD; it runs in-memory on the cluster, this makes repeated access to the same data much faster. Spark provides flexible and fancy APIs. You are not tied to Hadoop’s MapReduce two-stage paradigm.

Before introducing how calculate Cube with Spark, let’s see how Kylin do that with MR; Figure 1 illustrates how a 4-dimension Cube get calculated with the classic “by-layer” algorithm: the first round MR aggregates the base (4-D) cuboid from source data; the second MR aggregates on the base cuboid to get the 3-D cuboids; With N+1 round MR all layers’ cuboids get calculated.

MapReduce Cubing by Layer

The “by-layer” Cubing divides a big task into a couple steps, and each step bases on the previous step’s output, so it can reuse the previous calculation and also avoid calculating from very beginning when there is a failure in between. These makes it as a reliable algorithm. When moving to Spark, we decide to keep this algorithm, that’s why we call this feature as “By layer Spark Cubing”.

As we know, RDD (Resilient Distributed Dataset) is a basic concept in Spark. A collection of N-Dimension cuboids can be well described as an RDD, a N-Dimension Cube will have N+1 RDD. These RDDs have the parent/child relationship as the parent can be used to generate the children. With the parent RDD cached in memory, the child RDD’s generation can be much efficient than reading from disk. Figure 2 describes this process.

Spark Cubing by Layer

Figure 3 is the DAG of Cubing in Spark, it illustrates the process in detail: In “Stage 5”, Kylin uses a HiveContext to read the intermediate Hive table, and then do a “map” operation, which is an one to one map, to encode the origin values into K-V bytes. On complete Kylin gets an intermediate encoded RDD. In “Stage 6”, the intermediate RDD is aggregated with a “reduceByKey” operation to get RDD-1, which is the base cuboid. Nextly, do an “flatMap” (one to many map) on RDD-1, because the base cuboid has N children cuboids. And so on, all levels’ RDDs get calculated. These RDDs will be persisted to distributed file system on complete, but be cached in memory for next level’s calculation. When child be generated, it will be removed from cache.

DAG of Spark Cubing

We did a test to see how much performance improvement can gain from Spark:

Environment

4 nodes Hadoop cluster; each node has 28 GB RAM and 12 cores;
YRAN has 48GB RAM and 30 cores in total;
CDH 5.8, Apache Kylin 2.0 beta.

Spark

Spark 1.6.3 on YARN
6 executors, each has 4 cores, 4GB +1GB (overhead) memory

Test Data

Airline data, total 160 million rows
Cube: 10 dimensions, 5 measures (SUM)

Test Scenarios

Build the cube at different source data level: 3 million, 50 million and 160 million source rows; Compare the build time with MapReduce (by layer) and Spark. No compression enabled.
The time only cover the building cube step, not including data preparations and subsequent steps.

Spark vs MR performance

Spark is faster than MR in all the 3 scenarios, and overall it can reduce about half time in the cubing.

Now you can download a 2.0.0 beta build from Kylin’s download page, and then follow this post to build a cube with Spark engine. If you have any comments or inputs, please discuss in the community.

↧

Apache Kylin v2.0.0 beta 发布

February 25, 2017, 12:00 pm

≫ Next: Apache Kylin v2.0.0 Beta Announcement

≪ Previous: By-layer Spark Cubing

Apache Kylin社区非常高兴地宣布 v2.0.0 beta package已经可以下载并测试了。

下载链接: http://kylin.apache.org/cn/download/
源代码: https://github.com/apache/kylin/tree/kylin-2.0.0-beta

自从v1.6.0版本发布已经2个多月了。这段时间里，整个社区协力开发完成了一系列重大的功能，希望能将Apache Kylin提升到一个新的高度。

支持雪花模型 (KYLIN-1875)
支持 TPC-H 查询 (KYLIN-2467)
Spark 构建引擎 (KYLIN-2331)
Job Engine 高可用性 (KYLIN-2006)
Percentile 度量 (KYLIN-2396)
在 Cloud 上通过测试 (KYLIN-2351)

非常欢迎大家下载并测试 v2.0.0 beta。您的反馈对我们非常重要，请发邮件到 dev@kylin.apache.org。

安装

暂时 v2.0.0 beta 无法从 v1.6.0 直接升级，必需全新安装。这是由于新版本的元数据并不向前兼容。好在 Cube 数据是向前兼容的，因此只需要开发一个元数据转换工具，就能在不久的将来实现平滑升级。我们正在为此努力。

运行 TPC-H 基准测试

在 Apache Kylin 上运行 TPC-H 的具体步骤: https://github.com/Kyligence/kylin-tpch

Spark 构建引擎

Apache Kylin v2.0.0 引入了一个全新的基于 Apache Spark 的构建引擎。它可用于替换原有的 MapReduce 构建引擎。初步测试显示 Cube 的构建时间一般能缩短到原先的 50% 左右。

启用 Spark 构建引擎:

仔细检查 kylin.properties中的 Spark Engine Configs 部分。
- 确保指定的 HADOOP_CONF_DIR目录中包含 core, yarn, hive, 和 hbase 的 site xml。
- 根据集群的具体配置调整 Spark executor instances, cores, 和 memory。
- 测试中 Hive on Tez 会遇到问题，切换到 Hive on MR 可以绕过。
在创建新 Cube 时，在“高级”配置中选用 Spark 构建引擎即可。

感谢每一位朋友的参与和贡献!

↧

Apache Kylin v2.0.0 Beta Announcement

February 25, 2017, 12:00 pm

≫ Next: A new measure for Percentile precalculation

≪ Previous: Apache Kylin v2.0.0 beta 发布

The Apache Kylin community is pleased to announce the v2.0.0 beta package is ready for download and test.

Download link: http://kylin.apache.org/download/
Source code: https://github.com/apache/kylin/tree/kylin-2.0.0-beta

It has been more than 2 month since the v1.6.0 release. The community has been working hard to deliver some long wanted features, hoping to move Apache Kylin to the next level.

Support snowflake data model (KYLIN-1875)
Support TPC-H queries (KYLIN-2467)
Spark cubing engine (KYLIN-2331)
Job engine HA (KYLIN-2006)
Percentile measure (KYLIN-2396)
Cloud tested (KYLIN-2351)

You are very welcome to give the v2.0.0 beta a try, and please do send feedbacks to dev@kylin.apache.org.

Install

The v2.0.0 beta requires a refresh install at the moment. It cannot be upgraded from v1.6.0 due to the incompatible metadata. However the underlying cube is backward compatible. We are working on an upgrade tool to transform the metadata, so that a smooth upgrade will be possible.

Run TPC-H Benchmark

Steps to run TPC-H benchmark on Apache Kylin can be found here: https://github.com/Kyligence/kylin-tpch

Spark Cubing Engine

Apache Kylin v2.0.0 introduced a new cubing engine based on Apache Spark that can be selected to replace the original MR engine. Initial tests showed that the spark engine could cut the build time to 50% in most cases.

To enable the Spark cubing engine:

Go through the “Spark Engine Configs” settings in kylin.properties carefully.
- Make sure the HADOOP_CONF_DIR contains site xmls of core, yarn, hive, and hbase.
- Adjust the numbers of spark executor instances, cores, and memory according to your environment.
- Hive on Tez somehow did not work out as we tested. Switching to Hive on MR worked.
When creating a new cube, select the spark engine in the advanced settings tab. And that is it!

Great thanks to everyone who contributed!

↧

A new measure for Percentile precalculation

April 1, 2017, 3:22 pm

≫ Next: Improving Spark Cubing in Kylin 2.0

≪ Previous: Apache Kylin v2.0.0 Beta Announcement

Introduction

Since Apache Kylin 2.0, there’s a new measure for percentile precalculation, which aims at (sub-)second latency for approximate percentile analytics SQL queries. The implementation is based on t-digest library under Apachee 2.0 license, which provides a high-effecient data structure to save aggregation counters and algorithm to calculate approximate result of percentile.

Percentile

From wikipedia: A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.

In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation function called percentile(<Number Column>, <Double>):

SELECTseller_id,percentile(price,0.5)FROMtest_kylin_factGROUPBYseller_id

How to use

If you know little about Cubes, please go to QuickStart first to learn basic knowledge.

Firstly, you need to add this column as measure in data model.

Secondly, create a cube and add a PERCENTILE measure.

Finally, build the cube and try some query.

↧

Improving Spark Cubing in Kylin 2.0

July 21, 2017, 3:22 pm

≫ Next: Get Your Interactive Analytics Superpower, with Apache Kylin and Apache Superset

≪ Previous: A new measure for Percentile precalculation

Apache Kylin is a OALP Engine that speeding up query by Cube precomputation. The Cube is multi-dimensional dataset which contain precomputed all measures in all dimension combinations. Before v2.0, Kylin uses MapReduce to build Cube. In order to get better performance, Kylin 2.0 introduced the Spark Cubing. About the principle of Spark Cubing, please refer to the article By-layer Spark Cubing.

In this blog, I will talk about the following topics:

How to make Spark Cubing support HBase cluster with Kerberos enabled
Spark configurations for Cubing
Performance of Spark Cubing
Pros and cons of Spark Cubing
Applicable scenarios of Spark Cubing
Improvement for dictionary loading in Spark Cubing

In currently Spark Cubing(2.0) version, it doesn’t support HBase cluster using Kerberos bacause Spark Cubing need to get matadata from HBase. To solve this problem, we have two solutions: one is to make Spark could connect HBase with Kerberos, the other is to avoid Spark connect to HBase in Spark Cubing.

Make Spark connect HBase with Kerberos enabled

If just want to run Spark Cubing in Yarn client mode, we only need to add three line code before new SparkConf() in SparkCubingByLayer:

        Configuration configuration = HBaseConnection.getCurrentHBaseConfiguration();        
        HConnection connection = HConnectionManager.createConnection(configuration);
        //Obtain an authentication token for the given user and add it to the user's credentials.
        TokenUtil.obtainAndCacheToken(connection, UserProvider.instantiate(configuration).create(UserGroupInformation.getCurrentUser()));

As for How to make Spark connect HBase using Kerberos in Yarn cluster mode, please refer to SPARK-6918, SPARK-12279, and HBASE-17040. The solution may work, but not elegant. So I tried the sencond solution.

Use HDFS metastore for Spark Cubing

The core idea here is uploading the necessary metadata job related to HDFS and using HDFSResourceStore manage the metadata.

Before introducing how to use HDFSResourceStore instead of HBaseResourceStore in Spark Cubing. Let’s see what’s Kylin metadata format and how Kylin manages the metadata.

Every concrete metadata for table, cube, model and project is a JSON file in Kylin. The whole metadata is organized by file directory. The picture below is the root directory for Kylin metadata,
屏幕快照 2017-07-02 下午3.51.43.png-20.7kB
This following picture shows the content of project dir, the “learn_kylin” and “kylin_test” are both project names.
屏幕快照 2017-07-02 下午3.54.59.png-11.8kB

Kylin manage the metadata using ResourceStore, ResourceStore is a abstract class, which abstract the CRUD Interface for metadata. ResourceStore has three implementation classes：

FileResourceStore (store with Local FileSystem)
HDFSResourceStore
HBaseResourceStore

Currently, only HBaseResourceStore could use in production env. FileResourceStore mainly used for testing. HDFSResourceStore doesn’t support massive concurrent write, but it is ideal to use for read only scenario like Cubing. Kylin use the “kylin.metadata.url” config to decide which kind of ResourceStore will be used.

Now, Let’s see How to use HDFSResourceStore instead of HBaseResourceStore in Spark Cubing.

Determine the necessary metadata for Spark Cubing job
Dump the necessary metadata from HBase to local
Update the kylin.metadata.url and then write all Kylin config to “kylin.properties” file in local metadata dir.
Use ResourceTool upload the local metadata to HDFS.
Construct the HDFSResourceStore from the HDFS “kylin.properties” file in Spark executor.

Of course, We need to delete the HDFS metadata dir on complete. I’m working on a patch for this, please watch KYLIN-2653 for update.

Spark configurations for Cubing

Following is the Spark configuration I used in our environment. It enables Spark dynamic resource allocation; the goal is to let our user set less Spark configurations.

//running in yarn-cluster mode
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster 

//enable the dynamic allocation for Spark to avoid user set the number of executors explicitly
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=10
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1024
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.shuffle.service.port=7337

//the memory config
kylin.engine.spark-conf.spark.driver.memory=4G
//should enlarge the executor.memory when the cube dict is huge
kylin.engine.spark-conf.spark.executor.memory=4G 
//because kylin need to load the cube dict in executor
kylin.engine.spark-conf.spark.executor.cores=1

//enlarge the timeout
kylin.engine.spark-conf.spark.network.timeout=600

kylin.engine.spark-conf.spark.yarn.queue=root.hadoop.test

kylin.engine.spark.rdd-partition-cut-mb=100

Performance test of Spark Cubing

For the source data scale from millions to hundreds of millions, my test result is consistent with the blog By-layer Spark Cubing. The improvement is remarkable. Moreover, I also tested with billions of source data and having huge dictionary specially.

The test Cube1 has 2.7 billion source data, 9 dimensions, one precise distinct count measure having 70 million cardinality (which means the dict also has 70 million cardinality).

Test test Cube2 has 2.4 billion source data, 13 dimensions, 38 measures(contains 9 precise distinct count measures).

The test result is shown in below picture, the unit of time is minute.
image.png-38.1kB

In one word, Spark Cubing is much faster than MR cubing in most scenes.

Pros and Cons of Spark Cubing

In my opinion, the advantage for Spark Cubing includes:

Because of the RDD cache, Spark Cubing could take full advantage of memory to avoid disk I/O.
When we have enough memory resource, Spark Cubing could use more memory resource to get better build performance.

On the contrary，the drawback for Spark Cubing includes:

Spark Cubing couldn’t handle huge dictionary well (hundreds of millions of cardinality);
Spark Cubing isn’t stable enough for very large scale data.

Applicable scenarios of Spark Cubing

In my opinion, except the huge dictionary scenario, we all could use Spark Cubing to replace MR Cubing, especially under the following scenarios:

Many dimensions
Normal dictionaries (e.g, cardinality < 1 hundred millions)
Normal scale data (e.g, less than 10 billion rows to build at once).

Improvement for dictionary loading in Spark Cubing

As we all known, a big difference for MR and Spark is, the task for MR is running in process, but the task for Spark is running in thread. So, in MR Cubing, the dict of Cube only load once, but in Spark Cubing, the dict will be loaded many times in one executor, which will cause frequent GC.

So, I made the two improvements:

Only load the dict once in one executor.
Add maximumSize for LoadingCache in the AppendTrieDictionary to make the dict removed as early as possible.

These two improvements have been contributed into Kylin repository.

Summary

Spark Cubing is a great feature for Kylin 2.0, Thanks Kylin community. We will apply Spark Cubing in real scenarios in our company. I believe Spark Cubing will be more robust and efficient in the future releases.

↧

Get Your Interactive Analytics Superpower, with Apache Kylin and Apache Superset

January 1, 2018, 4:28 am

≫ Next: Apache Kylin v2.3.0 Release Announcement

≪ Previous: Improving Spark Cubing in Kylin 2.0

Challenge of Big Data

In the big data era, all enterprises’ face the growing demand and challenge of processing large volumes of data—workloads that traditional legacy systems can no longer satisfy. With the emergence of Artificial Intelligence (AI) and Internet-of-Things (IoT) technology, it has become mission-critical for businesses to accelerate their pace of discovering valuable insights from their massive and ever-growing datasets. Thus, large companies are constantly searching for a solution, often turning to open source technologies. We will introduce two open source technologies that, when combined together, can meet these pressing big data demands for large enterprises.

Apache Kylin: a Leading OpenSource OLAP-on-Hadoop

Modern organizations have had a long history of applying Online Analytical Processing (OLAP) technology to analyze data and uncover business insights. These insights help businesses make informed decisions and improve their service and product. With the emergence of the Hadoop ecosystem, OLAP has also embraced new technologies in the big data era.

Apache Kylin is one such technology that directly addresses the challenge of conducting analytical workloads on massive datasets. It is already widely adopted by enterprises around the world. With powerful pre-calculation technology, Apache Kylin enables sub-second query latency over petabyte-scale datasets. The innovative and intricate design of Apache Kylin allows it to seamlessly consume data from any Hadoop-based data source, as well as other relational database management system (RDBMS). Analysts can use Apache Kylin using standard SQL through ODBC, JDBC, and Restful API, which enables the platform to integrate with any third-party applications.

Figure 1: Apache Kylin Architecture

In a fast-paced and rapidly-changing business environment, business users and analysts are expected to uncover insights with speed of thoughts. They can meet this expectation with Apache Kylin, and no longer subjected to the predicament of waiting for hours for one single query to return results. Such a powerful data processing engine empowers the data scientists, engineers, and business analysts of any enterprise to find insights to help reach critical business decisions. However, business decisions cannot be made without rich data visualization. To address this last-mile challenge of big data analytics, Apache Superset comes into the picture.

Apache Superset: Modern, Enterprise-ready Business Intelligence Platform

Apache Superset is a data exploration and visualization platform designed to be visual, intuitive, and interactive. A user can access data in the following two ways:

Access data from the following commonly used data sources one table at a time: Kylin, Presto, Hive, Impala, SparkSQL, MySQL, Postgres, Oracle, Redshift, SQL Server, Druid.
Use a rich SQL Interactive Development Environment (IDE) called SQL Lab that is designed for power users with the ability to write SQL queries to analyze multiple tables.

Users can immediately analyze and visualize their query results using Apache Superset ‘s rich visualization and reporting features.

Figure 2

Figure 3: Apache Superset Visualization Interface

Integrating Apache Kylin and Apache Superset to Boost Your Productivity

Both Apache Kylin and Apache Superset are built to provide fast and interactive analytics for their users. The combination of these two open source projects can bring that goal to reality on petabyte-scale datasets, thanks to pre-calculated Kylin Cube.

The Kyligence Data Science team has recently open sourced kylinpy, a project that makes this combination possible. Kylinpy is a Python-based Apache Kylin client library. Any application that uses SQLAlchemy can now query Apache Kylin with this library installed, specifically Apache Superset. Below is a brief tutorial that shows how to integrate Apache Kylin and Apache Superset.

Prerequisite

Install Apache Kylin
Please refer to this installation tutorial.
Apache Kylin provides a script for you to create a sample Cube. After you successfully installed Apache Kylin, you can run the below script under Apache Kylin installation directory to generate sample project and Cube.
./${KYLIN_HOME}/bin/sample.sh
When the script finishes running, log onto Apache Kylin web with default user ADMIN/KYLIN; in the system page click “Reload Metadata,” then you will see a sample project called “Learn Kylin.”
Select the sample cube “kylin_sales_cube”, click “Actions” -> “Build”, pick a date later than 2014-01-01 (to cover all 10000 sample records);

Figure 4: Build Cube in Apache Kylin

Check the build progress in “Monitor” tab until it reaches 100%;
Execute SQL in the “Insight” tab, for example:

  select part_dt，
         sum(price) as total_selled，
         count(distinct seller_id) as sellers
  from kylin_sales
  group by part_dt
  order by part_dt
-- #This query will hit on the newly built Cube “Kylin_sales_cube”.

Next, we will install Apache Superset and initialize it.
You may refer to Apache Superset official website instruction to install and initialize.
Install kylinpy

   $ pip install kylinpy

Verify your installation, if everything goes well, Apache Superset daemon should be up and running.

$ superset runserver -d
Starting server with command:
gunicorn -w 2 --timeout 60 -b  0.0.0.0:8088 --limit-request-line 0 --limit-request-field_size 0 superset:app

[2018-01-03 15:54:03 +0800] [73673] [INFO] Starting gunicorn 19.7.1
[2018-01-03 15:54:03 +0800] [73673] [INFO] Listening at: http://0.0.0.0:8088 (73673)
[2018-01-03 15:54:03 +0800] [73673] [INFO] Using worker: sync
[2018-01-03 15:54:03 +0800] [73676] [INFO] Booting worker with pid: 73676
[2018-01-03 15:54:03 +0800] [73679] [INFO] Booting worker with pid: 73679

Connect Apache Kylin from ApacheSuperset

Now everything you need is installed and ready to go. Let’s try to create an Apache Kylin data source in Apache Superset.
1. Open up http://localhost:8088 in your web browser with the credential you set during Apache Superset installation.

Figure 5: Apache Superset Login Page

Go to Source -> Datasource to configure a new data source.
- SQLAlchemy URI pattern is : kylin://:@:/
- Check “Expose in SQL Lab” if you want to expose this data source in SQL Lab.
- Click “Test Connection” to see if the URI is working properly.

Figure 6: Create an Apache Kylin data source

Figure 7: Test Connection to Apache Kylin

If the connection to Apache Kylin is successful, you will see all the tables from Learn_kylin project show up at the bottom of the connection page.

Figure 8: Tables will show up if connection is successful

Query Kylin Table

Go to Source -> Tables to add a new table, type in a table name from “Learn_kylin” project, for example, “Kylin_sales”.

Figure 9 Add Kylin Table in Apache Superset

Click on the table you created. Now you are ready to analyze your data from Apache Kylin.

Figure 10 Query single table from Apache Kylin

Query Multiple Tables from Kylin Using SQL Lab.

Kylin Cube is usually based on a data model joined by multiples tables. Thus, it is quite common to query multiple tables at the same time using Apache Kylin. In Apache Superset, you can use SQL Lab to join your data across tables by composing SQL queries. We will use a query that can hit on the sample cube “kylin_sales_cube” as an example.
When you run your query in SQL Lab, the result will come from the data source, in this case, Apache Kylin.

Figure 11 Query multiple tables from Apache Kylin using SQL Lab

When the query returns results, you may immediately visualize them by clicking on the “Visualize” button.

Figure 12 Define your query and visualize it immediately

You may copy the entire SQL below to experience how you can query Kylin Cube in SQL Lab.
select YEAR_BEG_DT, MONTH_BEG_DT， WEEK_BEG_DT， META_CATEG_NAME， CATEG_LVL2_NAME, CATEG_LVL3_NAME, OPS_REGION, NAME as BUYER_COUNTRY_NAME, sum(PRICE) as GMV, sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL, count(*) as CNT from KYLIN_SALES join KYLIN_CAL_DT on CAL_DT=PART_DT join KYLIN_CATEGORY_GROUPINGS on SITE_ID=LSTG_SITE_ID and KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID=KYLIN_SALES.LEAF_CATEG_ID join KYLIN_ACCOUNT on ACCOUNT_ID=BUYER_ID join KYLIN_COUNTRY on ACCOUNT_COUNTRY=COUNTRY group by YEAR_BEG_DT, MONTH_BEG_DT，WEEK_BEG_DT，META_CATEG_NAME，CATEG_LVL2_NAME, CATEG_LVL3_NAME, OPS_REGION, NAME
## Experience All Features in Apache Superset with Apache Kylin

Most of the common reporting features are available in Apache Superset. Now let’s see how we can use those features to analyze data from Apache Kylin.

Sorting

You may sort by a measure regardless of how it is visualized.

You may specify a “Sort By” measure or sort the measure on the visualization after the query returns.

Figure 13 Sort by

Filtering

There are multiple ways you may filter data from Apache Kylin.
1. Date Filter
You may filter date and time dimension with the calendar filter.

Figure 14 Filtering time

Dimension Filter
For other dimensions, you may filter it with SQL conditions like “in, not in, equal to, not equal to, greater than and equal to, smaller than and equal to, greater than, smaller than, like”.

Figure 15 Filtering dimension
Search Box
In some visualizations, it is also possible to further narrow down your result set after the query is returned from the data source using the “Search Box”.

Figure 16 Search Box
Filtering the measure
Apache Superset allows you to write a “having clause” to filtering the measure.

Figure 17 Filtering measure
Filter Box
The filter box visualization allows you to create a drop-down style filter that can filter all slices on a dashboard dynamically
As the screenshot below shows, if you filter the CATE_LVL2_NAME dimension from the filter box, all the visualizations on this dashboard will be filtered based on your selection.

Figure 18 The filter box visualization

Top-N

To provide higher performance in query time for Top N query, Apache Kylin provides approximate Top N measure to pre-calculate the top records. In Apache Superset, you may use both “Sort By” and “Row Limit” feature to make sure your query can utilize the Top N pre-calculation from Kylin Cube.

Figure 19 use both “Sort By” and “Row Limit” to get Top 10

Page Length

Apache Kylin users usually need to deal with high cardinality dimension. When displaying a high cardinality dimension, the visualization will display too many distinct values, taking a long time to render. In that case, it is nice that Apache Superset provides the page length feature to limit the number of rows per page. This way the up-front rendering effort can be reduced.

Figure 20 Limit page length

Visualizations

Apache Superset provides a rich and extensive set of visualizations. From basic charts like a pie chart, bar chart, line chart to advanced visualizations, like a sunburst, heatmap, world map, Sankey diagram.

Figure 21

Figure 22

Figure 23 World map visualization

Figure 24 bubble chart

Other functionalities

Apache Superset also supports exporting to CSV, sharing, and viewing SQL query.

Summary

With the right technical synergy of open source projects, you can achieve amazing results, more than the sum of its parts. The pre-calculation technology of Apache Kylin accelerates visualization performance. The rich functionality of Apache Superset enables all Kylin Cube features to be fully utilized. When you marry the two, you get the superpower of accelerated interactive analytics.

References

↧