Quantcast
Channel: Apache Kylin
Viewing all 83 articles
Browse latest View live

Apache Kylin v1.5.2 Release Announcement

$
0
0

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.5.2.

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.

To download Apache Kylin v1.5.2 source code or binary package:
please visit the download page.

This is a major release which brings more stable, robust and well management version, Apache Kylin community resolved about 76 issues including bug fixes, improvements, and few new features.

Change Highlights

New Feature

Improvement

  • Enhance mail notification KYLIN-869
  • HiveColumnCardinalityJob should use configurations in conf/kylin_job_conf.xml KYLIN-955
  • Enable deriving dimensions on non PK/FK KYLIN-1313
  • Improve performance of converting data to hfile KYLIN-1323
  • Tools to extract all cube/hybrid/project related metadata to facilitate diagnosing/debugging/* sharing KYLIN-1340
  • change RealizationCapacity from three profiles to specific numbers KYLIN-1381
  • quicker and better response to v2 storage engine’s rpc timeout exception KYLIN-1391
  • Memory hungry cube should select LAYER and INMEM cubing smartly KYLIN-1418
  • For GUI, to add one option “yyyy-MM-dd HH:MM:ss” for Partition Date Column KYLIN-1432
  • cuboid sharding based on specific column KYLIN-1453
  • attach a hyperlink to introduce new aggregation group KYLIN-1487
  • Move query cache back to query controller level KYLIN-1526
  • Hfile owner is not hbase KYLIN-1542
  • Make hbase encoding and block size configurable just like hbase compression KYLIN-1544
  • Refactor storage engine(v2) to be extension friendly KYLIN-1561
  • Add and use a separate kylin_job_conf.xml for in-mem cubing KYLIN-1566
  • Front-end work for KYLIN-1557 KYLIN-1567
  • Coprocessor thread voluntarily stop itself when it reaches timeout KYLIN-1578
  • IT preparation classes like BuildCubeWithEngine should exit with status code upon build * exception KYLIN-1579
  • Use 1 byte instead of 8 bytes as column indicator in fact distinct MR job KYLIN-1580
  • Specify region cut size in cubedesc and leave the RealizationCapacity in model as a hint KYLIN-1584
  • make MAX_HBASE_FUZZY_KEYS in GTScanRangePlanner configurable KYLIN-1585
  • show cube level configuration overwrites properties in CubeDesigner KYLIN-1587
  • enabling different block size setting for small column families KYLIN-1591
  • Add “isShardBy” flag in rowkey panel KYLIN-1599
  • Need not to shrink scan cache when hbase rows can be large KYLIN-1601
  • User could dump hbase usage for diagnosis KYLIN-1602
  • Bring more information in diagnosis tool KYLIN-1614
  • Use deflate level 1 to enable compression “on the fly” KYLIN-1621
  • Make the hll precision for data samping configurable KYLIN-1623
  • HyperLogLogPlusCounter will become inaccurate when there’re billions of entries KYLIN-1624
  • GC log overwrites old one after restart Kylin service KYLIN-1625
  • add backdoor toggle to dump binary cube storage response for further analysis KYLIN-1627

Bug

  • column width is too narrow for timestamp field KYLIN-989
  • cube data not updated after purge KYLIN-1197
  • Can not get more than one system admin email in config KYLIN-1305
  • Should check and ensure TopN measure has two parameters specified KYLIN-1551
  • Unsafe check of initiated in HybridInstance#init() KYLIN-1563
  • Select any column when adding a custom aggregation in GUI KYLIN-1569
  • Unclosed ResultSet in QueryService#getMetadata() KYLIN-1574
  • NPE in Job engine when execute MR job KYLIN-1581
  • Agg group info will be blank when trying to edit cube KYLIN-1593
  • columns in metric could also be in filter/groupby KYLIN-1595
  • UT fail, due to String encoding CharsetEncoder mismatch KYLIN-1596
  • cannot run complete UT at windows dev machine KYLIN-1598
  • Concurrent write issue on hdfs when deploy coprocessor KYLIN-1604
  • Cube is ready but insight tables not result KYLIN-1612
  • UT ‘HiveCmdBuilderTest’ fail on ‘testBeeline’ KYLIN-1615
  • Can’t find any realization coursed by Top-N measure KYLIN-1619
  • sql not executed and report topN error KYLIN-1622
  • Web UI of TopN, “group by” column couldn’t be a dimension column KYLIN-1631
  • Unclosed OutputStream in SSHClient#scpFileToLocal() KYLIN-1634
  • Sample cube build error KYLIN-1637
  • Unclosed HBaseAdmin in ToolUtil#getHBaseMetaStoreId() KYLIN-1638
  • Wrong logging of JobID in MapReduceExecutable.java KYLIN-1639
  • Kylin’s hll counter count “NULL” as a value KYLIN-1643
  • Purge a cube, and then build again, the start date is not updated KYLIN-1647
  • java.io.IOException: Filesystem closed - in Cube Build Step 2 (MapR) KYLIN-1650
  • function name ‘getKylinPropertiesAsInputSteam’ misspelt KYLIN-1655
  • Streaming/kafka config not match with table name KYLIN-1660
  • tableName got truncated during request mapping for /tables/tableName KYLIN-1662
  • Should check project selection before add a stream table KYLIN-1666
  • Streaming table name should allow enter “DB.TABLE” format KYLIN-1667
  • make sure metadata in 1.5.2 compatible with 1.5.1 KYLIN-1673
  • MetaData clean just clean FINISHED and DISCARD jobs,but job correct status is SUCCEED KYLIN-1678
  • error happens while execute a sql contains ‘?’ using Statement KYLIN-1685
  • Illegal char on result dataset table KYLIN-1688
  • KylinConfigExt lost base properties when store into file KYLIN-1721
  • IntegerDimEnc serialization exception inside coprocessor KYLIN-1722

Upgrade

Data and metadata of this version is back compatible with v1.5.1, but may need to redeploy hbase coprocessor.

Support

Any issue or question, please
open JIRA to Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!


Apache Kylin v1.5.2 正式发布

$
0
0

Apache Kylin社区非常高兴宣布Apache Kylin v1.5.2正式发布。

Apache Kylin是一个开源的分布式分析引擎,提供Hadoop之上的SQL查询接口及多维分析(OLAP)能力以支持超大规模数据,最初由eBay Inc. 开发并贡献至开源社区。

下载Apache Kylin v1.5.2源代码及二进制安装包,
请访问下载页面.

这是一个主要的版本发布带来了更稳定,健壮及更好管理的版本,Apache Kylin社区解决了75个issue,包括Bug修复,功能增强及一些新特性等。

主要变化

新功能

改进

  • 增强邮件通知 KYLIN-869
  • HiveColumnCardinalityJob应该使用conf/kylin_job_conf.xml中的配置 KYLIN-955
  • 在非PK/FK上支持继承的维度 KYLIN-1313
  • 增强转换数据到HFile阶段的性能 KYLIN-1323
  • 抽取cube/hybrid/project相关元数据信息以便于诊断/调试/分享等用途 KYLIN-1340
  • 把RealizationCapacity从3套配置改成特定数字 KYLIN-1381
  • 更快更好的响应以应对v2存储引擎中的rpc超时异常 KYLIN-1391
  • 内存需求较大的Cube应该更智能地选择LAYER还是INMEM构建算法 KYLIN-1418
  • 在GUI上,给分区时间列添加一个”yyyy-MM-dd HH:MM:ss”选项 KYLIN-1432
  • 基于特定列进行Cuboid分片 KYLIN-1453
  • 添加超链接介绍新的Aggregation Group KYLIN-1487
  • 把查询缓存调整到查询控制器级别 KYLIN-1526
  • Hfile所有者不是hbase KYLIN-1542
  • 使hbase编码和block size像hbase压缩一样可配置 KYLIN-1544
  • 重构v2存储引擎使之对扩展更加友好 KYLIN-1561
  • 为in-memory构建任务添加并使用一个单独kylin_job_conf.xml KYLIN-1566
  • KYLIN-1557前端工作 KYLIN-1567
  • 协助利器线程在超时后自动停止 KYLIN-1578
  • IT测试如BuildCubeWithEngine等的准备阶段应该在出现异常后报错退出 KYLIN-1579
  • 在Fact distinct的MR任务中用1个字节代替8字节作为列标识符 KYLIN-1580
  • 在Cubedesc上指定Region切分size并使model中的RealizationCapacity仅仅作为提示 KYLIN-1584
  • 使MAX_HBASE_FUZZY_KEYS在GTScanRangePlanner中变得可配置KYLIN-1585
  • 在CubeDesigner显示Cube级别的配置覆盖 KYLIN-1587
  • 对于小的列族可以使用不同的block size KYLIN-1591
  • 在Rowkey面板添加”isShardBy”标志 KYLIN-1599
  • 在hbase行很大的时候不需要缩小扫描缓存 KYLIN-1601
  • 用户应该可以到处hbase使用情况协助诊断 KYLIN-1602
  • 为诊断工具添加更多信息 KYLIN-1614
  • 在协处理器中使用1级deflate压缩 KYLIN-1621
  • 使数据采样时hll精度可配置 KYLIN-1623
  • 当有十亿数据规模时HyperLogLogPlusCounter会变得不精确 KYLIN-1624
  • GC日志在重启后覆盖老文件 KYLIN-1625
  • 添加调试接口以导出二进制cube存储情况以助于未来分析 KYLIN-1627

Bug

  • 时间戳字段的列宽太小 KYLIN-989
  • cube数据在purge后没有更新 KYLIN-1197
  • 不能在配置中获取超过一个的系统管理员邮箱 KYLIN-1305
  • 应该检查并确保topn度量必须指定两个参数 KYLIN-1551
  • HybridInstance#init()中进行非安全性的初始化 KYLIN-1563
  • 在GUI中添加一个自定义聚合时选择一个列 KYLIN-1569
  • QueryService#getMetadata()存在没有关闭的ResultSet KYLIN-1574
  • 在Job engine中执行MR任务时报出NPE KYLIN-1581
  • 当编辑Cube时聚合组信息会变空 KYLIN-1593
  • 度量列可以出现在filter/groupby中 KYLIN-1595
  • 字符串编码不一致导致UT失败 KYLIN-1596
  • 在windows开发机不能完整执行单元测试 KYLIN-1598
  • 部署协处理器时会出现hdfs并发写问题 KYLIN-1604
  • Cube已经就绪但是insight中的表没有记录 KYLIN-1612
  • 单元测试’HiveCmdBuilderTest’在’testBeeline’失败 KYLIN-1615
  • 因topn度量引起的找不到realization KYLIN-1619
  • sql无法执行并报出topn错误 KYLIN-1622
  • TopN界面,”group by”列不能使用一个维度列 KYLIN-1631
  • SSHClient#scpFileToLocal()有未关闭的OutputStream KYLIN-1634
  • 样例Cube构建出错 KYLIN-1637
  • ToolUtil#getHBaseMetaStoreId()中有未关闭的HBaseAdmin KYLIN-1638
  • MapReduceExecutable.java中使用了错误的日志记录 KYLIN-1639
  • Kylin的hll计数器把null当做一个有效值 KYLIN-1643
  • Purge一个cube并再次构建,起始日期没有被更新 KYLIN-1647
  • java.io.IOException: Filesystem closed - 在Cube构建第二步(MapR) KYLIN-1650
  • 函数名’getKylinPropertiesAsInputSteam’拼写错误 KYLIN-1655
  • Streaming/kafka配置和表名不匹配 KYLIN-1660
  • 表名在和/tables/tableName做请求映射时被截断 KYLIN-1662
  • 在添加steam表时应该检查project选择 KYLIN-1666
  • Streaming表名应该遵从”DB.TABLE”格式 KYLIN-1667
  • 确保1.5.2和1.5.1的元数据兼容 KYLIN-1673
  • 元数据清理工具只清理了FINISHED和DISCARD的任务,但是一个任务的正确状态是SUCCEED KYLIN-1678
  • 当使用Statement时sql中包含问号会报错 KYLIN-1685
  • 结果显示表格中有非法字符 KYLIN-1688
  • KylinConfigExt在存储到文件时丢失基本信息 KYLIN-1721
  • IntegerDimEnc在协处理器中有序列化异常 KYLIN-1722

升级

该版本的数据与元数据与v1.5.1完全兼容,但也许需要更新HBase协处理器.

支持

升级和使用过程中有任何问题,请:
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

RAW measure in Apache Kylin

$
0
0

Introduction

RAW measure function is use to query the detail data on the measure column in Kylin.

Example data:

DTSITE_IDSELLER_IDITEM_COUNT
2016-05-010SELLER-001100
2016-05-010SELLER-002200
2016-05-021SELLER-003300
2016-05-021SELLER-004400
2016-05-032SELLER-005500

We design the cube desc is the DT,SITE_ID columns as dimensions, and SUM(ITEM_COUNT) as measure. So, the base cuboid data will like this:

Rowkey of base cuboidSUM(ITEM_COUNT)
2016-05-01_0300
2016-05-02_1700
2016-05-03_2500

For the first row in the base cuboid data, Kylin can extract the dimension column values 2016-05-01,0 from the HBase Rowkey, and in the measure cell will store the measure function’s aggregated results 300, we can’t get the raw value 100 and 200 which before the aggregation on the ITEM_COUNT column.

The RAW function is use to make the SQL:

SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE

to return the correct result:

DTSITE_IDITEM_COUNT
2016-05-010100
2016-05-010200
2016-05-021300
2016-05-021400
2016-05-032500

How to use

  • Choose the Kylin version 1.5.1+.
  • Like the above case, we can make the DT,SITE_ID as dimensions, and RAW(ITEM_COUNT)as measure.
  • After the cube build, you can use the SQL to query the raw data:
SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE WHERE SITE_ID = 0

Optimize

The column which define RAW measure will be encoded with dictionary by default. So, you must know you data’s cardinality and distribution characteristics.

  • As far as possible to define the value uniform distribution column to dimensions, this will make the measure cell value size more uniform and avoid data skew.
  • If choose the ultra high cardinality column to define RAW measure, you can try the following to avoid the dictionary build error:
    1. Cut a big segment into several segments, if you were trying to build a large data set at once;
    2. Set kylin.dictionary.max.cardinality in conf/kylin.properties to a bigger value (default is 5000000).

To be improved

  • Now, the maximum storage 1M values of RAW measure in one cuboid. If exceed 1M values, it will throw BufferOverflowException in the cube build. This will be optimized in the later release.
  • Only dimension column can use in WHERE condition, RAW measure column is not support.

Implement

  • Custom one aggregation function RAW implement, the function’s return type depends on the column type.
  • Make the RAW aggregation function to save the column raw data in the base cuboid data.
  • The HBase value cell will store the dictionary id of the raw data to save space.
  • The SQL which contains the RAW measure column will be routed to the base cuboid query.
  • Extract the raw data from base cuboid data with dimension values to assemble into a complete row when query.

Deploy Apache Kylin with Standalone HBase Cluster

$
0
0

Introduction

Apache Kylin mainly use HBase to storage cube data. The performance of HBase cluster impacts on the query performance of Kylin directly. In common scenario, HBase is deployed with MR/Hive on one HDFS cluster, which makes that the resouces HBase used is limited, and the MR job affects the performance of HBase. These problems can be resolved with standalone HBase cluster, and Apache Kylin has support this deploy mode for now.

Enviroment Requirements

To enable standalone HBase cluster supporting, check the basic enviroments at first:

  • Deploy the main cluster and hbase cluster, make sure both works normally
  • Make sure Kylin Server can access both clusters using hdfs shell with fully qualifiered path
  • Make sure Kylin Server can submit MR job to main cluster, and can use hive shell to access data warehouse, make sure the configurations of hadoop and hive points to main cluster
  • Make sure Kylin Server can access hbase cluster using hbase shell, make sure the configuration of hbase points to hbase cluster
  • Make sure the job on main cluster can access hbase cluster directly

Configurations

Update the config kylin.hbase.cluster.fs in kylin.properties, with a value as the Namenode address of HBase Cluster, like hdfs://hbase-cluster-nn01.example.com:8020

Notice that the value should keep consistent with the Namenode address of root.dir on HBase Master node, to ensure bulkload into hbase.

Using NN HA

HDFS Namenode HA improved the availablity of cluster significantly, and maybe the HBase cluster enabled it. Apache Kylin doesn’t support the HA perfectly for now, and here’s the workaroud:

  • Add all dfs.nameservices related configs of HBase Cluster into hadoop/etc/hadoop/hdfs-site.xml in Kylin Server, to make sure that can access HBase Cluster using hdfs shell with nameservice path
  • Add all dfs.nameservices related configs of both two clusters into kylin_job_conf.xml, to make sure that the MR job can access hbase cluster with nameservice path

TroubleShooting

  • UnknownHostException occurs during Cube Building
    It usually occurs with HBase HA nameservice config, please refer the above section “Using NN HA”
  • HFile BulkLoading Stucks for long time
    Check the regionserver log, there should be lots of error log, with WrongFS exception. Make sure the namenode address in kylin.properites/kylin.hbase.cluster.fs and hbase master node hbase-site.xml/root.dir is same

Diagnosis Tool Introduction

$
0
0

Introduction

Since Apache Kylin 1.5.2, there’s a diagnosis tool on Web UI, which aims to help Kylin admins to extract diagnostic information for fault analysis and performance tunning.

Project Diagnosis

When user met issues about query failure, slow queries, metadata management and so on, he could click the ‘Diagnosis’ button on System tabpage.

Several seconds later, a diagnosis package will be avaliable to download from web browser, which contains kylin logs, metadata, configuration etc. Users could extract the package and analyze. Also when users asking help from experts in his team, attaching the package would raise the communication effeiciency.

Job Diagnosis

When user met issues about jobs, such as cubing failed, slow job and so on, he could click the ‘Diagnosis’ button in the Job’s Action menu.

The same with Project Diagnosis, a diagnosis package will be downloaded from web browser, which contains related logs, MR job info, metadata etc. User could use it for analysis or ask for help.

Apache Kylin v1.5.3 Release Announcement

$
0
0

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.5.3.

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.

To download Apache Kylin v1.5.3 source code or binary package:
please visit the download page.

This is a major release which brings more stable, robust and well management version, Apache Kylin community resolved about 84 issues including bug fixes, improvements, and few new features.

Change Highlights

  • A better way to check hadoop job status KYLIN-1319
  • Global (and more scalable) dictionary KYLIN-1705
  • More stable and functional precise count distinct implements after KYLIN-1186 KYLIN-1379
  • Improve performance of MRv2 engine by making each mapper handles a configured number of records KYLIN-1656
  • Distribute source data by certain columns when creating flat table KYLIN-1677
  • Allow cube to override MR job configuration by properties KYLIN-1706
  • Allow non-admin user to edit ‘Advenced Setting’ step in CubeDesigner KYLIN-1731
  • Calculate all 0 (except mandatory) cuboids KYLIN-1747
  • Allow mandatory only cuboid KYLIN-1749
  • Couldn’t use View as Lookup when join type is “inner” KYLIN-1789
  • Exception inside coprocessor should report back to the query thread KYLIN-1645
  • minimize dependencies of JDBC driver KYLIN-1846
  • TopN measure support non-dictionary encoding KYLIN-1478

Upgrade

Follow the upgrade guide.

Support

Any issue or question, please
open JIRA to Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!

Apache Kylin v1.5.3 正式发布

$
0
0

Apache Kylin社区非常高兴宣布Apache Kylin v1.5.3正式发布。

Apache Kylin是一个开源的分布式分析引擎,提供Hadoop之上的SQL查询接口及多维分析(OLAP)能力以支持超大规模数据,最初由eBay Inc. 开发并贡献至开源社区。

下载Apache Kylin v1.5.3源代码及二进制安装包,
请访问下载页面.

这是一个主要的版本发布带来了更稳定,健壮及更好管理的版本,Apache Kylin社区解决了84个issue,包括Bug修复,功能增强及一些新特性等。

主要变化

  • 采用标准API获取Hadoop任务的状态 KYLIN-1319
  • 全局的(扩展性更好的)字典编码方法 KYLIN-1705
  • 更稳定的精确去重(count distinct)度量 KYLIN-1379
  • 通过指定每个Mapper处理纪录的数量,从而提高Cube构建性能 KYLIN-1656
  • 在创建Hive平表时按某些列(UHC)列来分散数据 KYLIN-1677
  • 允许在Cube级别覆盖MR任务的属性 KYLIN-1706
  • 允许非管理员用户编辑修改Cube向导的“高级设置”页 KYLIN-1731
  • 计算全0组合(mandantory维度除外) cuboids KYLIN-1747
  • 允许全部维度都是mandatory KYLIN-1749
  • 修复“当连接类型时inner时不能使用view做维度表”的问题 KYLIN-1789
  • HBase coprocessor出错时将Exception传回查询线程 KYLIN-1645
  • 精简JDBC driver的依赖 KYLIN-1846
  • TopN度量支持使用非字典的编码方式 KYLIN-1478

升级

参见升级指南.

支持

升级和使用过程中有任何问题,请:
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

Use Count Distinct in Apache Kylin

$
0
0

Since v.1.5.3

Background

Count Distinct is a commonly measure in OLAP analyze, usually used for uv, etc. Apache Kylin offers two kinds of count distinct, approximately and precisely, differs on resource and performance.

Approximately Count Distinct

Apache Kylin implements approximately count distinct using HyperLogLog algorithm, offered serveral precision, with the error rates from 9.75% to 1.22%.
The result of measure has theorically upper limit in size, as 2^N bytes. For the max precision N=16, the upper limit is 64KB, and the max error rate is 1.22%.
This implementation’s pros is fast caculating and storage resource saving, but can’t be used for precisely requirements.

Precisely Count Distinct

Apache Kylin also implements precisely count distinct based on bitmap. For the data with type tiny int(byte), small int(short) and int, project the value into the bitmap directly. For the data with type long, string and others, encode the value as String into a dict, and project the dict id into the bitmap.
The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the rusult is always right with any roll-up, even across segments.
This implementation’s pros is precesily result, without error, but needs more storage resources. One result size maybe hundreds of MB, when the count distinct value over millions.

Global Dictionary

Apache Kylin encode value into dictionay at the segment level by default. That means one same value in different segments maybe encoded into different id, which means the result of precisely count distinct maybe not correct.
We introduced Global Dictionary with ensurance that one same value always encode into same id in different segments, to resolve this problem. Meanwhile, the capacity of dict has expanded dramatically, upper to support 2G values in one dict. It can also be used to replace default dictionary which has 5M values limitation.
Current version has no UI for global dictionary yet, and the cube desc json shoule be modified to enable it:

"dictionaries": [
    {
          "column": "SUCPAY_USERID",
	  "reuse": "USER_ID",
          "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
    }
]

The column means the column which to be encoded, the builder specifies the dictionary builder, only org.apache.kylin.dict.GlobalDictionaryBuilder is available for now.
The ‘reuse` is used to optimize the dict of more than one columns based on one dataset, please refer the next section ‘Example’ for more details.
The global dictionay can’t be used for dimensiion encoding for now, that means if one column is used for dimension and count distinct measure in one cube, the dimension encoding should be others but not dict.

Example

Here’s some example data:
| DT | USER_ID | FLAG1 | FLAG2 | USER_ID_FLAG1 | USER_ID_FLAG2 |
| :———-: | :——: | :—: | :—: | :————-: | :————-: |
| 2016-06-08 | AAA | 1 | 1 | AAA | AAA |
| 2016-06-08 | BBB | 1 | 1 | BBB | BBB |
| 2016-06-08 | CCC | 0 | 1 | NULL | CCC |
| 2016-06-09 | AAA | 0 | 1 | NULL | AAA |
| 2016-06-09 | CCC | 1 | 0 | CCC | NULL |
| 2016-06-10 | BBB | 0 | 1 | NULL | BBB |

There’s basic columns DT, USER_ID, FLAG1, FLAG2, and condition columns USER_ID_FLAG1=if(FLAG1=1,USER_ID,null), USER_ID_FLAG2=if(FLAG2=1,USER_ID,null). Supposed the cube is builded by day, has 3 segments.

Without the global dictionay, the precisely count distinct in semgent is correct, but the roll-up acrros segments result is wrong. Here’s an example:

select count(distinct user_id_flag1) from table where dt in ('2016-06-08', '2016-06-09')

The result is 2 but not 3. The reason is that the dict in 2016-06-08 segment is AAA=>1, BBB=>1, and the dict in 2016-06-09 segment is CCC=> 1.
With global dictionary config as below, the dict became as AAA=>1, BBB=>2, CCC=>3, that will procude correct result.
"dictionaries": [ { "column": "USER_ID_FLAG1", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" } ]

Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a subset of USER_ID dataset, that made the dictionary re-using possible. Just encode the USER_ID dataset, and config USER_ID_FLAG1 and USER_ID_FLAG2 resue USER_ID dict:
"dictionaries": [ { "column": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" }, { "column": "USER_ID_FLAG1", "reuse": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" }, { "column": "USER_ID_FLAG2", "reuse": "USER_ID", "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" } ]

Performance Tunning

When using global dictionary and the dictionary is large, the step ‘Build Base Cuboid Data’ may took long time. That mainly caused by the dictionary cache loading and eviction cost, since the dictionary size is bigger than mapper memory size. To solve this problem, overwrite the cube configuration as following, adjust the mapper size to 8GB:
kylin.job.mr.config.override.mapred.map.child.java.opts=-Xmx8g kylin.job.mr.config.override.mapreduce.map.memory.mb=8500

Conclusions

Here’s some basically pricipal to decide which kind of count distinct will be used:
- If the result with error rate is acceptable, approximately way is always an better way
- If you need precisely result, the only way is precisely count distinct
- If you don’t need roll-up across segments, or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and consider the reuse column optimization


Query Metrics in Apache Kylin

$
0
0

Apache Kylin support query metrics since 1.5.4. This blog will introduce why Kylin need query metrics, the concrete contents and meaning of query metrics, the daily function of query metrics and how to collect query metrics.

Background

When Kylin become an enterprise application, you must ensure Kylin query service is high availability and high performance, besides, you need to provide commitment of the SLA of query service to users, Which need Kylin to support query metrics.

Introduction

The query metrics have Server, Project, Cube three levels.

For example, QueryCount will have three kinds of metrics:
```
Hadoop:name=Server_Total,service=Kylin.QueryCount
Hadoop:name=learn_kylin,service=Kylin.QueryCount
Hadoop:name=learn_kylin,service=Kylin,sub=kylin_sales_cube.QueryCount

Server_Total is represent for a query server node,
learn_kylin is a project name,
kylin_sales_cube is a cube name.
```
### The Key Query Metrics

  • QueryCount: the total of query count.
  • QueryFailCount: the total of failed query count.
  • QuerySuccessCount: the total of successful query count.
  • CacheHitCount: the total of query cache hit count.
  • QueryLatency60s99thPercentile: the 99th percentile of query latency in the 60s.(there are 99th, 95th, 90th, 75th, 50th five percentiles and 60s, 360s, 3600s three time intervals in Kylin query metrics. the time intervals could set by kylin.query.metrics.percentiles.intervals, which default value is 60, 360, 3600)
  • QueryLatencyAvgTimeQueryLatencyIMaxTimeQueryLatencyIMinTime: the average, max, min of query latency.
  • ScanRowCount: the rows count of scan HBase, it’s like QueryLatency.
  • ResultRowCount: the result count of query, it’s like QueryLatency.

Daily Function

Besides providing SLA of query service to users, in the daily operation and maintenance, you could make Kylin query daily and Kylin query dashboard by query metrics. Which will help you know the rules, performance of Kylin query and analyze the Kylin query accident case.

How To Use

Firstly, you should set config kylin.query.metrics.enabled as true to collect query metrics to JMX.

Secondly, you could use arbitrary JMX collection tool to collect the query metrics to your monitor system. Notice that, The query metrics have Server, Project, Cube three levels, which was implemented by dynamic ObjectName, so you should get ObjectName by regular expression.

New NRT Streaming in Apache Kylin

$
0
0

In 1.5.0 Apache Kylin introduces the Streaming Cubing feature, which can consume data from Kafka topic directly. This blog introduces how that be implemented, and this tutorial introduces how to use it.

While, that implementation was marked as “experimental” because it has the following limitations:

  • Not scalable: it starts a Java process for a micro-batch cube building, instead of leveraging any computing framework; If too many messages arrive at one time, the build may fail with OutOfMemory error;

  • May loss data: it uses a time window to seek the approximate start/end offsets on Kafka topic, which means too late/early arrived messages will be skipped; Then the query couldn’t ensure 100% accuracy.

  • Difficult to monitor: the streaming cubing is out of the Job engine’s scope, user can not monitor the jobs with Web GUI or REST API.

  • Others: hard to recover from accident, difficult to maintain the code, etc.

To overcome these limitations, the Apache Kylin team developed the new streaming (KYLIN-1726) with Kafka 0.10, it has been tested internally for some time, will release to public soon.

The new design is a perfect implementation under Kylin 1.5’s “plug-in” architecture: treat Kafka topic as a “Data Source” like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as other cubes. Figure 1 is a high level architecture of the new design.

Kylin New Streaming Framework Architecture

The adapter to read Kafka messages is modified from kafka-hadoop-loader, the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.

To overcome the “data loss” limitation, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: each segment has a “min” date/time and a “max” date/time; Kylin will scan all segments which matched with the queried time scope. Figure 2 illurates this.

Use Offset to Cut Segments

Other changes/enhancements are made in the new streaming:

  • Allow multiple segments being built/merged concurrently
  • Automatically seek start/end offsets (if user doesn’t specify) from previous segment or Kafka
  • Support embeded properties in JSON message
  • Add REST API to trigger streaming cube’s building
  • Add REST API to check and fill the segment holes

The integration test result is promising:

  • Scalability: it can easily process up to hundreds of million records in one build;
  • Flexibility: you can trigger the build at any time, with the frequency you want; for example: every 5 minutes in day time but every hour in night time, and even pause when you need do a maintenance; Kylin manages the offsets so it can automatically continue from the last position;
  • Stability: pretty stable, no OutOfMemoryError;
  • Management: user can check all jobs’ status through Kylin’s “Monitor” page or REST API;
  • Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes;

Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial in the future:
Streaming Job Monitoring

Streaming Adapter

Streaming Twitter Sample

In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the previous version). Nextly, the Apache Kylin team will move toward a Real Time engine.

Use Window Function and Grouping Sets in Apache Kylin

$
0
0

Since v.1.5.4

Background

We’ve provided window function and grouping sets in Apache Kylin, to support more complicate query, keeping SQL statements simple and clearly. Here’s the article about HOW TO use them.

Window Function

Window function give us the ability to partition, order and aggregate on the query result sets. We can use window function to meet complicated query requirements with much more simple SQL statements.
The window function syntax of Apache Kylin can be found in calcite reference, and is similar with Hive window function.
Here’s some examples:

sum(col) over()
count(col) over(partition by col2)
row_number() over(partition by col order by col2)
first_value(col) over(partition by col2 order by col3 rows 2 preceding)
lag(col, 5, 0) over(partition by col2, order by col3 range 3 preceding 6 following)

Grouping Sets

Sometimes we want aggregate data by different keys in one SQL statements. This is possible with the grouping sets feature.
Here’s example, suppose we execute one query and get result as below:

select dim1, dim2, sum(col) as metric1 from table group by dim1, dim2
dim1dim2metric1
AAA10
ABB20
BAA15
BBB30

If we also want to see the result with dim2 rolled up, we can rewrite the sql and get result as below:

select dim1,
case grouping(dim2) when 1 then 'ALL' else dim2 end,
sum(col) as metric1
from table
group by grouping sets((dim1, dim2), (dim1))
dim1dim2metric1
AALL30
AAA10
ABB20
BALL45
BAA15
BBB30

Apache Kylin support cube/rollup/grouping sets, and grouping functions can be found here. It’s also similar with Hive grouping sets.

Retention Or Conversion Rate Analyze in Apache Kylin

$
0
0

Since v.1.6.0

Background

Retention or conversion rate is important in data analysis. In general, the value can be calculated based on the intersection of two data sets (uuid etc.), with some same dimensions (city, category, etc.) and one variety dimension (date etc.).
Apache Kylin has support retention calculation based on the Bitmap and UDAF intersect_count. This article introduced how to use this feature.

Usage

To use retention calculation in Apache Kylin, must meet requirements as below:
* Only one dimension can be variety
* The measure to be calculated have defined precisely count distinct measure

The intersect_count usage is described below:

intersect_count(columnToCount, columnToFilter, filterValueList)
`columnToCount` the columnt to cacluate and distinct count
`columnToFilter` the variety dimension
`filterValueList` the values of variety dimension, should be array

Here’s some examples:

intersect_count(uuid, dt, array['20161014', '20161015'])
The precisely distinct count of uuids shows up both in 20161014 and 20161015

intersect_count(uuid, dt, array['20161014', '20161015', '20161016'])
The precisely distinct count of uuids shows up all in 20161014, 20161015 and 20161016

intersect_count(uuid, dt, array['20161014'])
The precisely distinct count of uuids shows up in 20161014, equivalent to `count(distinct uuid)`

A complete sql statement example:

select city, version,
intersect_count(uuid, dt, array['20161014']) as first_day,
intersect_count(uuid, dt, array['20161015']) as second_day,
intersect_count(uuid, dt, array['20161016']) as third_day,
intersect_count(uuid, dt, array['20161014', '20161015']) as retention_oneday,
intersect_count(uuid, dt, array['20161014', '20161015', '20161016']) as retention_twoday
from visit_log
where dt in ('2016104', '20161015', '20161016')
group by city, version

Conclusions

Based on Bitmap and UDAF intersect_count, we can do fast and convenient retention analyze on Apache Kylin. Compared with the traditional way, SQL in Apache Kylin can be much more simple and clearly, and more efficient.

Apache Kylin v1.6.0 Release Announcement

$
0
0

The Apache Kylin community is pleased to announce the release of Apache Kylin v1.6.0.

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

This is a major release after 1.5.4, with the reliable and scalable support for using Apache Kafka as data source; this enables user to build cubes directly from streaming data (without loading to Apache Hive), reducing the data latency from days/hours to minutes.

Apache Kylin 1.6.0 resolved 102 issues including bug fixes, improvements, and new features. All of the changes can be found in the release notes.

Change Highlights

  • Scalable streaming cubing KYLIN-1726
  • TopN counter merge performance improvement KYLIN-1917
  • Support Embedded Structure JSON Message KYLIN-1919
  • More robust approach to hive schema changes KYLIN-2012
  • TimedJsonStreamParser should support other time format KYLIN-2054
  • Add an encoder for Boolean type KYLIN-2055
  • Allowe concurrent build/refresh/merge KYLIN-2070
  • Support to change streaming configuration KYLIN-2082

To download Apache Kylin v1.6.0 source code or binary package, visit the download page.

Upgrade

Follow the upgrade guide.

Support

Any issue or question,
open JIRA to Apache Kylin project: https://issues.apache.org/jira/browse/KYLIN/
or
send mail to Apache Kylin dev mailing list: dev@kylin.apache.org

Great thanks to everyone who contributed!

Apache Kylin v1.6.0 正式发布

$
0
0

Apache Kylin社区非常高兴宣布Apache Kylin v1.6.0正式发布。

Apache Kylin是一个开源的分布式分析引擎,提供Hadoop之上的SQL查询接口及多维分析(OLAP)能力,支持对超大规模数据进行秒级查询。

Apache Kylin v1.6.0带来了更可靠更易于管理的从Apache Kafka流中直接构建Cube的能力,使得用户可以在更多场景中更自然地进行数据分析,使得数据从产生到被检索到的延迟,从以前的一天或数小时,降低到数分钟。 Apache Kylin 1.6.0修复了102个issue,包括缺陷,改进和新功能,详见release notes.

主要变化

下载Apache Kylin v1.6.0源代码及二进制安装包,请访问下载页面.

升级

参见升级指南.

支持

升级和使用过程中有任何问题,请:
提交至Kylin的JIRA: https://issues.apache.org/jira/browse/KYLIN/
或者
发送邮件到Apache Kylin邮件列表: dev@kylin.apache.org

感谢每一位朋友的参与和贡献!

By-layer Spark Cubing

$
0
0

Before v2.0, Apache Kylin uses Hadoop MapReduce as the framework to build Cubes over huge dataset. The MapReduce framework is simple, stable and can fulfill Kylin’s need very well except the performance. In order to get better performance, we introduced the “fast cubing” algorithm in Kylin v1.5, tries to do as much as possible aggregations at map side within memory, so to avoid the disk and network I/O; but not all data models can benefit from it, and it still runs on MR which means on-disk sorting and shuffling.

Now Spark comes; Apache Spark is an open-source cluster-computing framework, which provides programmers with an application programming interface centered on a data structure called RDD; it runs in-memory on the cluster, this makes repeated access to the same data much faster. Spark provides flexible and fancy APIs. You are not tied to Hadoop’s MapReduce two-stage paradigm.

Before introducing how calculate Cube with Spark, let’s see how Kylin do that with MR; Figure 1 illustrates how a 4-dimension Cube get calculated with the classic “by-layer” algorithm: the first round MR aggregates the base (4-D) cuboid from source data; the second MR aggregates on the base cuboid to get the 3-D cuboids; With N+1 round MR all layers’ cuboids get calculated.

MapReduce Cubing by Layer

The “by-layer” Cubing divides a big task into a couple steps, and each step bases on the previous step’s output, so it can reuse the previous calculation and also avoid calculating from very beginning when there is a failure in between. These makes it as a reliable algorithm. When moving to Spark, we decide to keep this algorithm, that’s why we call this feature as “By layer Spark Cubing”.

As we know, RDD (Resilient Distributed Dataset) is a basic concept in Spark. A collection of N-Dimension cuboids can be well described as an RDD, a N-Dimension Cube will have N+1 RDD. These RDDs have the parent/child relationship as the parent can be used to generate the children. With the parent RDD cached in memory, the child RDD’s generation can be much efficient than reading from disk. Figure 2 describes this process.

Spark Cubing by Layer

Figure 3 is the DAG of Cubing in Spark, it illustrates the process in detail: In “Stage 5”, Kylin uses a HiveContext to read the intermediate Hive table, and then do a “map” operation, which is an one to one map, to encode the origin values into K-V bytes. On complete Kylin gets an intermediate encoded RDD. In “Stage 6”, the intermediate RDD is aggregated with a “reduceByKey” operation to get RDD-1, which is the base cuboid. Nextly, do an “flatMap” (one to many map) on RDD-1, because the base cuboid has N children cuboids. And so on, all levels’ RDDs get calculated. These RDDs will be persisted to distributed file system on complete, but be cached in memory for next level’s calculation. When child be generated, it will be removed from cache.

DAG of Spark Cubing

We did a test to see how much performance improvement can gain from Spark:

Environment

  • 4 nodes Hadoop cluster; each node has 28 GB RAM and 12 cores;
  • YRAN has 48GB RAM and 30 cores in total;
  • CDH 5.8, Apache Kylin 2.0 beta.

Spark

  • Spark 1.6.3 on YARN
  • 6 executors, each has 4 cores, 4GB +1GB (overhead) memory

Test Data

  • Airline data, total 160 million rows
  • Cube: 10 dimensions, 5 measures (SUM)

Test Scenarios

  • Build the cube at different source data level: 3 million, 50 million and 160 million source rows; Compare the build time with MapReduce (by layer) and Spark. No compression enabled.
    The time only cover the building cube step, not including data preparations and subsequent steps.

Spark vs MR performance

Spark is faster than MR in all the 3 scenarios, and overall it can reduce about half time in the cubing.

Now you can download a 2.0.0 beta build from Kylin’s download page, and then follow this post to build a cube with Spark engine. If you have any comments or inputs, please discuss in the community.


Apache Kylin v2.0.0 beta 发布

$
0
0

Apache Kylin社区非常高兴地宣布 v2.0.0 beta package已经可以下载并测试了。

自从v1.6.0版本发布已经2个多月了。这段时间里,整个社区协力开发完成了一系列重大的功能,希望能将Apache Kylin提升到一个新的高度。

非常欢迎大家下载并测试 v2.0.0 beta。您的反馈对我们非常重要,请发邮件到 dev@kylin.apache.org


安装

暂时 v2.0.0 beta 无法从 v1.6.0 直接升级,必需全新安装。这是由于新版本的元数据并不向前兼容。好在 Cube 数据是向前兼容的,因此只需要开发一个元数据转换工具,就能在不久的将来实现平滑升级。我们正在为此努力。


运行 TPC-H 基准测试

在 Apache Kylin 上运行 TPC-H 的具体步骤: https://github.com/Kyligence/kylin-tpch


Spark 构建引擎

Apache Kylin v2.0.0 引入了一个全新的基于 Apache Spark 的构建引擎。它可用于替换原有的 MapReduce 构建引擎。初步测试显示 Cube 的构建时间一般能缩短到原先的 50% 左右。

启用 Spark 构建引擎:

  • 仔细检查 kylin.properties中的 Spark Engine Configs 部分。
    • 确保指定的 HADOOP_CONF_DIR目录中包含 core, yarn, hive, 和 hbase 的 site xml。
    • 根据集群的具体配置调整 Spark executor instances, cores, 和 memory。
    • 测试中 Hive on Tez 会遇到问题,切换到 Hive on MR 可以绕过。
  • 在创建新 Cube 时,在“高级”配置中选用 Spark 构建引擎即可。

感谢每一位朋友的参与和贡献!

Apache Kylin v2.0.0 Beta Announcement

$
0
0

The Apache Kylin community is pleased to announce the v2.0.0 beta package is ready for download and test.

It has been more than 2 month since the v1.6.0 release. The community has been working hard to deliver some long wanted features, hoping to move Apache Kylin to the next level.

You are very welcome to give the v2.0.0 beta a try, and please do send feedbacks to dev@kylin.apache.org.


Install

The v2.0.0 beta requires a refresh install at the moment. It cannot be upgraded from v1.6.0 due to the incompatible metadata. However the underlying cube is backward compatible. We are working on an upgrade tool to transform the metadata, so that a smooth upgrade will be possible.


Run TPC-H Benchmark

Steps to run TPC-H benchmark on Apache Kylin can be found here: https://github.com/Kyligence/kylin-tpch


Spark Cubing Engine

Apache Kylin v2.0.0 introduced a new cubing engine based on Apache Spark that can be selected to replace the original MR engine. Initial tests showed that the spark engine could cut the build time to 50% in most cases.

To enable the Spark cubing engine:

  • Go through the “Spark Engine Configs” settings in kylin.properties carefully.
    • Make sure the HADOOP_CONF_DIR contains site xmls of core, yarn, hive, and hbase.
    • Adjust the numbers of spark executor instances, cores, and memory according to your environment.
    • Hive on Tez somehow did not work out as we tested. Switching to Hive on MR worked.
  • When creating a new cube, select the spark engine in the advanced settings tab. And that is it!

Great thanks to everyone who contributed!

A new measure for Percentile precalculation

$
0
0

Introduction

Since Apache Kylin 2.0, there’s a new measure for percentile precalculation, which aims at (sub-)second latency for approximate percentile analytics SQL queries. The implementation is based on t-digest library under Apachee 2.0 license, which provides a high-effecient data structure to save aggregation counters and algorithm to calculate approximate result of percentile.

Percentile

From wikipedia: A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.

In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation function called percentile(<Number Column>, <Double>):

SELECTseller_id,percentile(price,0.5)FROMtest_kylin_factGROUPBYseller_id

How to use

If you know little about Cubes, please go to QuickStart first to learn basic knowledge.

Firstly, you need to add this column as measure in data model.

Secondly, create a cube and add a PERCENTILE measure.

Finally, build the cube and try some query.

Improving Spark Cubing in Kylin 2.0

$
0
0

Apache Kylin is a OALP Engine that speeding up query by Cube precomputation. The Cube is multi-dimensional dataset which contain precomputed all measures in all dimension combinations. Before v2.0, Kylin uses MapReduce to build Cube. In order to get better performance, Kylin 2.0 introduced the Spark Cubing. About the principle of Spark Cubing, please refer to the article By-layer Spark Cubing.

In this blog, I will talk about the following topics:

  • How to make Spark Cubing support HBase cluster with Kerberos enabled
  • Spark configurations for Cubing
  • Performance of Spark Cubing
  • Pros and cons of Spark Cubing
  • Applicable scenarios of Spark Cubing
  • Improvement for dictionary loading in Spark Cubing

In currently Spark Cubing(2.0) version, it doesn’t support HBase cluster using Kerberos bacause Spark Cubing need to get matadata from HBase. To solve this problem, we have two solutions: one is to make Spark could connect HBase with Kerberos, the other is to avoid Spark connect to HBase in Spark Cubing.

Make Spark connect HBase with Kerberos enabled

If just want to run Spark Cubing in Yarn client mode, we only need to add three line code before new SparkConf() in SparkCubingByLayer:

        Configuration configuration = HBaseConnection.getCurrentHBaseConfiguration();        
        HConnection connection = HConnectionManager.createConnection(configuration);
        //Obtain an authentication token for the given user and add it to the user's credentials.
        TokenUtil.obtainAndCacheToken(connection, UserProvider.instantiate(configuration).create(UserGroupInformation.getCurrentUser()));

As for How to make Spark connect HBase using Kerberos in Yarn cluster mode, please refer to SPARK-6918, SPARK-12279, and HBASE-17040. The solution may work, but not elegant. So I tried the sencond solution.

Use HDFS metastore for Spark Cubing

The core idea here is uploading the necessary metadata job related to HDFS and using HDFSResourceStore manage the metadata.

Before introducing how to use HDFSResourceStore instead of HBaseResourceStore in Spark Cubing. Let’s see what’s Kylin metadata format and how Kylin manages the metadata.

Every concrete metadata for table, cube, model and project is a JSON file in Kylin. The whole metadata is organized by file directory. The picture below is the root directory for Kylin metadata,
屏幕快照 2017-07-02 下午3.51.43.png-20.7kB
This following picture shows the content of project dir, the “learn_kylin” and “kylin_test” are both project names.
屏幕快照 2017-07-02 下午3.54.59.png-11.8kB

Kylin manage the metadata using ResourceStore, ResourceStore is a abstract class, which abstract the CRUD Interface for metadata. ResourceStore has three implementation classes:

  • FileResourceStore (store with Local FileSystem)
  • HDFSResourceStore
  • HBaseResourceStore

Currently, only HBaseResourceStore could use in production env. FileResourceStore mainly used for testing. HDFSResourceStore doesn’t support massive concurrent write, but it is ideal to use for read only scenario like Cubing. Kylin use the “kylin.metadata.url” config to decide which kind of ResourceStore will be used.

Now, Let’s see How to use HDFSResourceStore instead of HBaseResourceStore in Spark Cubing.

  1. Determine the necessary metadata for Spark Cubing job
  2. Dump the necessary metadata from HBase to local
  3. Update the kylin.metadata.url and then write all Kylin config to “kylin.properties” file in local metadata dir.
  4. Use ResourceTool upload the local metadata to HDFS.
  5. Construct the HDFSResourceStore from the HDFS “kylin.properties” file in Spark executor.

Of course, We need to delete the HDFS metadata dir on complete. I’m working on a patch for this, please watch KYLIN-2653 for update.

Spark configurations for Cubing

Following is the Spark configuration I used in our environment. It enables Spark dynamic resource allocation; the goal is to let our user set less Spark configurations.

//running in yarn-cluster mode
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster 

//enable the dynamic allocation for Spark to avoid user set the number of executors explicitly
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=10
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1024
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.shuffle.service.port=7337

//the memory config
kylin.engine.spark-conf.spark.driver.memory=4G
//should enlarge the executor.memory when the cube dict is huge
kylin.engine.spark-conf.spark.executor.memory=4G 
//because kylin need to load the cube dict in executor
kylin.engine.spark-conf.spark.executor.cores=1

//enlarge the timeout
kylin.engine.spark-conf.spark.network.timeout=600

kylin.engine.spark-conf.spark.yarn.queue=root.hadoop.test

kylin.engine.spark.rdd-partition-cut-mb=100

Performance test of Spark Cubing

For the source data scale from millions to hundreds of millions, my test result is consistent with the blog By-layer Spark Cubing. The improvement is remarkable. Moreover, I also tested with billions of source data and having huge dictionary specially.

The test Cube1 has 2.7 billion source data, 9 dimensions, one precise distinct count measure having 70 million cardinality (which means the dict also has 70 million cardinality).

Test test Cube2 has 2.4 billion source data, 13 dimensions, 38 measures(contains 9 precise distinct count measures).

The test result is shown in below picture, the unit of time is minute.
image.png-38.1kB

In one word, Spark Cubing is much faster than MR cubing in most scenes.

Pros and Cons of Spark Cubing

In my opinion, the advantage for Spark Cubing includes:

  1. Because of the RDD cache, Spark Cubing could take full advantage of memory to avoid disk I/O.
  2. When we have enough memory resource, Spark Cubing could use more memory resource to get better build performance.

On the contrary,the drawback for Spark Cubing includes:

  1. Spark Cubing couldn’t handle huge dictionary well (hundreds of millions of cardinality);
  2. Spark Cubing isn’t stable enough for very large scale data.

Applicable scenarios of Spark Cubing

In my opinion, except the huge dictionary scenario, we all could use Spark Cubing to replace MR Cubing, especially under the following scenarios:

  1. Many dimensions
  2. Normal dictionaries (e.g, cardinality < 1 hundred millions)
  3. Normal scale data (e.g, less than 10 billion rows to build at once).

Improvement for dictionary loading in Spark Cubing

As we all known, a big difference for MR and Spark is, the task for MR is running in process, but the task for Spark is running in thread. So, in MR Cubing, the dict of Cube only load once, but in Spark Cubing, the dict will be loaded many times in one executor, which will cause frequent GC.

So, I made the two improvements:

  1. Only load the dict once in one executor.
  2. Add maximumSize for LoadingCache in the AppendTrieDictionary to make the dict removed as early as possible.

These two improvements have been contributed into Kylin repository.

Summary

Spark Cubing is a great feature for Kylin 2.0, Thanks Kylin community. We will apply Spark Cubing in real scenarios in our company. I believe Spark Cubing will be more robust and efficient in the future releases.

Get Your Interactive Analytics Superpower, with Apache Kylin and Apache Superset

$
0
0

Challenge of Big Data

In the big data era, all enterprises’ face the growing demand and challenge of processing large volumes of data—workloads that traditional legacy systems can no longer satisfy. With the emergence of Artificial Intelligence (AI) and Internet-of-Things (IoT) technology, it has become mission-critical for businesses to accelerate their pace of discovering valuable insights from their massive and ever-growing datasets. Thus, large companies are constantly searching for a solution, often turning to open source technologies. We will introduce two open source technologies that, when combined together, can meet these pressing big data demands for large enterprises.

Apache Kylin: a Leading OpenSource OLAP-on-Hadoop

Modern organizations have had a long history of applying Online Analytical Processing (OLAP) technology to analyze data and uncover business insights. These insights help businesses make informed decisions and improve their service and product. With the emergence of the Hadoop ecosystem, OLAP has also embraced new technologies in the big data era.

Apache Kylin is one such technology that directly addresses the challenge of conducting analytical workloads on massive datasets. It is already widely adopted by enterprises around the world. With powerful pre-calculation technology, Apache Kylin enables sub-second query latency over petabyte-scale datasets. The innovative and intricate design of Apache Kylin allows it to seamlessly consume data from any Hadoop-based data source, as well as other relational database management system (RDBMS). Analysts can use Apache Kylin using standard SQL through ODBC, JDBC, and Restful API, which enables the platform to integrate with any third-party applications.

Figure 1: Apache Kylin Architecture

In a fast-paced and rapidly-changing business environment, business users and analysts are expected to uncover insights with speed of thoughts. They can meet this expectation with Apache Kylin, and no longer subjected to the predicament of waiting for hours for one single query to return results. Such a powerful data processing engine empowers the data scientists, engineers, and business analysts of any enterprise to find insights to help reach critical business decisions. However, business decisions cannot be made without rich data visualization. To address this last-mile challenge of big data analytics, Apache Superset comes into the picture.

Apache Superset: Modern, Enterprise-ready Business Intelligence Platform

Apache Superset is a data exploration and visualization platform designed to be visual, intuitive, and interactive. A user can access data in the following two ways:

  1. Access data from the following commonly used data sources one table at a time: Kylin, Presto, Hive, Impala, SparkSQL, MySQL, Postgres, Oracle, Redshift, SQL Server, Druid.

  2. Use a rich SQL Interactive Development Environment (IDE) called SQL Lab that is designed for power users with the ability to write SQL queries to analyze multiple tables.

Users can immediately analyze and visualize their query results using Apache Superset ‘s rich visualization and reporting features.


Figure 2


Figure 3: Apache Superset Visualization Interface

Integrating Apache Kylin and Apache Superset to Boost Your Productivity

Both Apache Kylin and Apache Superset are built to provide fast and interactive analytics for their users. The combination of these two open source projects can bring that goal to reality on petabyte-scale datasets, thanks to pre-calculated Kylin Cube.

The Kyligence Data Science team has recently open sourced kylinpy, a project that makes this combination possible. Kylinpy is a Python-based Apache Kylin client library. Any application that uses SQLAlchemy can now query Apache Kylin with this library installed, specifically Apache Superset. Below is a brief tutorial that shows how to integrate Apache Kylin and Apache Superset.

Prerequisite

  1. Install Apache Kylin
    Please refer to this installation tutorial.
  2. Apache Kylin provides a script for you to create a sample Cube. After you successfully installed Apache Kylin, you can run the below script under Apache Kylin installation directory to generate sample project and Cube.
    ./${KYLIN_HOME}/bin/sample.sh
  3. When the script finishes running, log onto Apache Kylin web with default user ADMIN/KYLIN; in the system page click “Reload Metadata,” then you will see a sample project called “Learn Kylin.”

  4. Select the sample cube “kylin_sales_cube”, click “Actions” -> “Build”, pick a date later than 2014-01-01 (to cover all 10000 sample records);


Figure 4: Build Cube in Apache Kylin

  1. Check the build progress in “Monitor” tab until it reaches 100%;
  2. Execute SQL in the “Insight” tab, for example:
  select part_dt,
         sum(price) as total_selled,
         count(distinct seller_id) as sellers
  from kylin_sales
  group by part_dt
  order by part_dt
-- #This query will hit on the newly built Cube “Kylin_sales_cube”.
  1. Next, we will install Apache Superset and initialize it.
    You may refer to Apache Superset official website instruction to install and initialize.
  2. Install kylinpy
   $ pip install kylinpy
  1. Verify your installation, if everything goes well, Apache Superset daemon should be up and running.
$ superset runserver -d
Starting server with command:
gunicorn -w 2 --timeout 60 -b  0.0.0.0:8088 --limit-request-line 0 --limit-request-field_size 0 superset:app

[2018-01-03 15:54:03 +0800] [73673] [INFO] Starting gunicorn 19.7.1
[2018-01-03 15:54:03 +0800] [73673] [INFO] Listening at: http://0.0.0.0:8088 (73673)
[2018-01-03 15:54:03 +0800] [73673] [INFO] Using worker: sync
[2018-01-03 15:54:03 +0800] [73676] [INFO] Booting worker with pid: 73676
[2018-01-03 15:54:03 +0800] [73679] [INFO] Booting worker with pid: 73679

Connect Apache Kylin from ApacheSuperset

Now everything you need is installed and ready to go. Let’s try to create an Apache Kylin data source in Apache Superset.
1. Open up http://localhost:8088 in your web browser with the credential you set during Apache Superset installation.

Figure 5: Apache Superset Login Page

  1. Go to Source -> Datasource to configure a new data source.
    • SQLAlchemy URI pattern is : kylin://:@:/
    • Check “Expose in SQL Lab” if you want to expose this data source in SQL Lab.
    • Click “Test Connection” to see if the URI is working properly.


Figure 6: Create an Apache Kylin data source


Figure 7: Test Connection to Apache Kylin

If the connection to Apache Kylin is successful, you will see all the tables from Learn_kylin project show up at the bottom of the connection page.


Figure 8: Tables will show up if connection is successful

Query Kylin Table

  1. Go to Source -> Tables to add a new table, type in a table name from “Learn_kylin” project, for example, “Kylin_sales”.


Figure 9 Add Kylin Table in Apache Superset

  1. Click on the table you created. Now you are ready to analyze your data from Apache Kylin.


Figure 10 Query single table from Apache Kylin

Query Multiple Tables from Kylin Using SQL Lab.

Kylin Cube is usually based on a data model joined by multiples tables. Thus, it is quite common to query multiple tables at the same time using Apache Kylin. In Apache Superset, you can use SQL Lab to join your data across tables by composing SQL queries. We will use a query that can hit on the sample cube “kylin_sales_cube” as an example.
When you run your query in SQL Lab, the result will come from the data source, in this case, Apache Kylin.


Figure 11 Query multiple tables from Apache Kylin using SQL Lab

When the query returns results, you may immediately visualize them by clicking on the “Visualize” button.

Figure 12 Define your query and visualize it immediately

You may copy the entire SQL below to experience how you can query Kylin Cube in SQL Lab.
select YEAR_BEG_DT, MONTH_BEG_DT, WEEK_BEG_DT, META_CATEG_NAME, CATEG_LVL2_NAME, CATEG_LVL3_NAME, OPS_REGION, NAME as BUYER_COUNTRY_NAME, sum(PRICE) as GMV, sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL, count(*) as CNT from KYLIN_SALES join KYLIN_CAL_DT on CAL_DT=PART_DT join KYLIN_CATEGORY_GROUPINGS on SITE_ID=LSTG_SITE_ID and KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID=KYLIN_SALES.LEAF_CATEG_ID join KYLIN_ACCOUNT on ACCOUNT_ID=BUYER_ID join KYLIN_COUNTRY on ACCOUNT_COUNTRY=COUNTRY group by YEAR_BEG_DT, MONTH_BEG_DT,WEEK_BEG_DT,META_CATEG_NAME,CATEG_LVL2_NAME, CATEG_LVL3_NAME, OPS_REGION, NAME
## Experience All Features in Apache Superset with Apache Kylin

Most of the common reporting features are available in Apache Superset. Now let’s see how we can use those features to analyze data from Apache Kylin.

Sorting

You may sort by a measure regardless of how it is visualized.

You may specify a “Sort By” measure or sort the measure on the visualization after the query returns.


Figure 13 Sort by

Filtering

There are multiple ways you may filter data from Apache Kylin.
1. Date Filter
You may filter date and time dimension with the calendar filter.

Figure 14 Filtering time

  1. Dimension Filter
    For other dimensions, you may filter it with SQL conditions like “in, not in, equal to, not equal to, greater than and equal to, smaller than and equal to, greater than, smaller than, like”.

    Figure 15 Filtering dimension

  2. Search Box
    In some visualizations, it is also possible to further narrow down your result set after the query is returned from the data source using the “Search Box”.

    Figure 16 Search Box

  3. Filtering the measure
    Apache Superset allows you to write a “having clause” to filtering the measure.

    Figure 17 Filtering measure

  4. Filter Box
    The filter box visualization allows you to create a drop-down style filter that can filter all slices on a dashboard dynamically
    As the screenshot below shows, if you filter the CATE_LVL2_NAME dimension from the filter box, all the visualizations on this dashboard will be filtered based on your selection.

    Figure 18 The filter box visualization

Top-N

To provide higher performance in query time for Top N query, Apache Kylin provides approximate Top N measure to pre-calculate the top records. In Apache Superset, you may use both “Sort By” and “Row Limit” feature to make sure your query can utilize the Top N pre-calculation from Kylin Cube.

Figure 19 use both “Sort By” and “Row Limit” to get Top 10

Page Length

Apache Kylin users usually need to deal with high cardinality dimension. When displaying a high cardinality dimension, the visualization will display too many distinct values, taking a long time to render. In that case, it is nice that Apache Superset provides the page length feature to limit the number of rows per page. This way the up-front rendering effort can be reduced.

Figure 20 Limit page length

Visualizations

Apache Superset provides a rich and extensive set of visualizations. From basic charts like a pie chart, bar chart, line chart to advanced visualizations, like a sunburst, heatmap, world map, Sankey diagram.

Figure 21


Figure 22


Figure 23 World map visualization


Figure 24 bubble chart

Other functionalities

Apache Superset also supports exporting to CSV, sharing, and viewing SQL query.

Summary

With the right technical synergy of open source projects, you can achieve amazing results, more than the sum of its parts. The pre-calculation technology of Apache Kylin accelerates visualization performance. The rich functionality of Apache Superset enables all Kylin Cube features to be fully utilized. When you marry the two, you get the superpower of accelerated interactive analytics.

References

  1. Apache Kylin
  2. kylinpy on Github
  3. Superset:Airbnb’s data exploration platform
  4. Apache Superset on Github
Viewing all 83 articles
Browse latest View live