美文网首页
Kylin-配置文件解读kylin.properties

Kylin-配置文件解读kylin.properties

作者: 李小李的路 | 来源:发表于2020-02-05 12:36 被阅读0次
    • 基于apache-kylin-3.0

    kylin配置文件

    • kylin.properties文件包含的配置信息部分很多,进行逐一的解读
    • kylin的配置信息总共分为以下部分,共15个部分:
      • METADATA | ENV (元数据和kylin模式)
      • SERVER | WEB | RESTCLIENT (kylin的web信息配置)
      • PUBLIC CONFIG (kylin web的基础配置,不重要)
      • SOURCE (数据源配置)
      • STORAGE(数据存储)
      • JOB(kylin job build一些配置)
      • ENGINE (计算引擎的配置)
      • CUBE | DICTIONARY(cube和字典的相关配置)
      • QUERY (查询相关配置)
      • SECURITY (kylin web页面登陆的权限控制)
      • SPARK ENGINE CONFIGS (spark引擎配置文件)
      • QUERY PUSH DOWN (kylin下压引擎配置文件)
      • JDBC Data Source (kylin JDBC数据源配置)
      • Livy with Kylin (kylin使用livy提交spark任务)
      • Realtime OLAP (kylin 流式olap配置)

    kylin.properties各部分

    • 在kylin环境部署的时候,绝大多数参数可以采用配置文件中给的默认设定。
    • 需要修改配置的时候,建议在对应的位置修改,这样方便查找和修改前后对比;

    METADATA | ENV

    • kylin元数据配置信息,建议维持不变
    #
    #### METADATA | ENV ###
    #
    ## The metadata store in hbase  
    kylin元数据在Hbase中表名
    #kylin.metadata.url=kylin_metadata@hbase
    #
    ## metadata cache sync retry times
    #kylin.metadata.sync-retries=3
    #
    ## Working folder in HDFS, better be qualified absolute path, make sure user has the right permission to this directory  
    kylin元数据在hdfs上存储路径,kylin中默认超过10m的元数据存储到hdfs,此路径和下文中SPARK ENGINE CONFIGS 中spark-history路径联动,此项配置建议保持默认,不建议修改
    #kylin.env.hdfs-working-dir=/kylin
    #
    ## DEV|QA|PROD. DEV will turn on some dev features, QA and PROD has no difference in terms of functions.
    kylin的运行环境,测试环境可以尝试为DEV。生产环境可以保持QA默认;
    #kylin.env=QA
    #
    ## kylin zk base path
    kylin在zk中存储的路径,kylin在zk中存储的信息包括:kylin job的分布式调度、[dict, job_engine, create_htable] 等信息,kylin源码中用到了多线程读取元数据,需要使用锁。
    #kylin.env.zookeeper-base-path=/kylin
    #
    

    SERVER | WEB | RESTCLIENT

    • kylin的运行模式、web服务等信息,此部分需要根据实际情况修改
    #### SERVER | WEB | RESTCLIENT ###
    #
    ## Kylin server mode, valid value [all, query, job]
    kylin的server mode,单机模式设定为all,集群模式需要根据需要修改各个节点的mode
    #kylin.server.mode=all
    #
    ## List of web servers in use, this enables one web server instance to sync up with other servers.
    主要是针对集群模式,英文逗号分割
    #kylin.server.cluster-servers=localhost:7070
    #
    ## Display timezone on UI,format like[GMT+N or GMT-N]
    kylin web页面显示的时区,china地区设置为GMT+8
    #kylin.web.timezone=
    #
    ## Timeout value for the queries submitted through the Web UI, in milliseconds
    #kylin.web.query-timeout=300000
    #
    #kylin.web.cross-domain-enabled=true
    #
    ##allow user to export query result
    #kylin.web.export-allow-admin=true
    #kylin.web.export-allow-other=true
    #
    ## Hide measures in measure list of cube designer, separate by comma
    #kylin.web.hide-measures=RAW
    #
    ##max connections of one route
    #kylin.restclient.connection.default-max-per-route=20
    #
    ##max connections of one rest-client
    #kylin.restclient.connection.max-total=200
    #
    

    PUBLIC CONFIG

    • kylin一些公共的配置信息,不建议修改,维持默认配置较好
    #### PUBLIC CONFIG ###
    计算引擎默认是mapreduce
    #kylin.engine.default=2
    存储默认是Hbase
    #kylin.storage.default=2
    #kylin.web.hive-limit=20
    #kylin.web.help.length=4
    #kylin.web.help.0=start|Getting Started|http://kylin.apache.org/docs/tutorial/kylin_sample.html
    #kylin.web.help.1=odbc|ODBC Driver|http://kylin.apache.org/docs/tutorial/odbc.html
    #kylin.web.help.2=tableau|Tableau Guide|http://kylin.apache.org/docs/tutorial/tableau_91.html
    #kylin.web.help.3=onboard|Cube Design Tutorial|http://kylin.apache.org/docs/howto/howto_optimize_cubes.html
    #kylin.web.link-streaming-guide=http://kylin.apache.org/
    #kylin.htrace.show-gui-trace-toggle=false
    #kylin.web.link-hadoop=
    #kylin.web.link-diagnostic=
    #kylin.web.contact-mail=
    #kylin.server.external-acl-provider=
    #
    ## Default time filter for job list, 0->current day, 1->last one day, 2->last one week, 3->last one year, 4->all
    对kylin页面默认显示的数据范围,像kylin monitor页面,会有时间段的展示
    #kylin.web.default-time-filter=1
    #
    

    SOURCE

    • kylin数据源,这里是的是Hive
    #### SOURCE ###
    #
    ## Hive client, valid value [cli, beeline]
    默认采用的是hive cli方式
    #kylin.source.hive.client=cli
    #
    ## Absolute path to beeline shell, can be set to spark beeline instead of the default hive beeline on PATH
    #kylin.source.hive.beeline-shell=beeline
    #
    ## Parameters for beeline client, only necessary if hive client is beeline
    ##kylin.source.hive.beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000
    #
    ## While hive client uses above settings to read hive table metadata,
    ## table operations can go through a separate SparkSQL command line, given SparkSQL connects to the same Hive metastore.
    #kylin.source.hive.enable-sparksql-for-table-ops=false
    ##kylin.source.hive.sparksql-beeline-shell=/path/to/spark-client/bin/beeline
    ##kylin.source.hive.sparksql-beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000
    #
    flat-table在kylin job运行结束后会被清理掉
    #kylin.source.hive.keep-flat-table=false
    #
    ## Hive database name for putting the intermediate flat tables
    kylin job过程中使用到的中间临时表都存在hive的default库下,建议保持默认;
    #kylin.source.hive.database-for-flat-table=default
    # 
    ## Whether redistribute the intermediate flat table before building
    #kylin.source.hive.redistribute-flat-table=true
    #
    #
    

    STORAGE

    • kylin的数据存储,需要根据实际需要进行修改
    #### STORAGE ###
    #
    ## The storage for final cube file in hbase
    #kylin.storage.url=hbase
    #
    ## The prefix of hbase table
    kylin segment表在hbae的表名前缀
    #kylin.storage.hbase.table-name-prefix=KYLIN_
    #
    ## The namespace for hbase storage
    kylin元数据在Hbase中存储的namespace
    #kylin.storage.hbase.namespace=default
    #
    ## Compression codec for htable, valid value [none, snappy, lzo, gzip, lz4]
    htable是否采用压缩,如果设定了压缩需要测试服务器环境是否支持对应的压缩格式
    #kylin.storage.hbase.compression-codec=none
    #
    ## HBase Cluster FileSystem, which serving hbase, format as hdfs://hbase-cluster:8020
    ## Leave empty if hbase running on same cluster with hive and mapreduce
    ##kylin.storage.hbase.cluster-fs=
    #
    ## The cut size for hbase region, in GB.
    #kylin.storage.hbase.region-cut-gb=5
    #
    ## The hfile size of GB, smaller hfile leading to the converting hfile MR has more reducers and be faster.
    ## Set 0 to disable this optimization.
    #kylin.storage.hbase.hfile-size-gb=2
    #
    #kylin.storage.hbase.min-region-count=1
    #kylin.storage.hbase.max-region-count=500
    #
    ## Optional information for the owner of kylin platform, it can be your team's email
    ## Currently it will be attached to each kylin's htable attribute
    Hbase表schema中会出现对应的onwer
    #kylin.storage.hbase.owner-tag=whoami@kylin.apache.org
    #
    #kylin.storage.hbase.coprocessor-mem-gb=3
    #
    ## By default kylin can spill query's intermediate results to disks when it's consuming too much memory.
    ## Set it to false if you want query to abort immediately in such condition.
    #kylin.storage.partition.aggr-spill-enabled=true
    #
    ## The maximum number of bytes each coprocessor is allowed to scan.
    ## To allow arbitrary large scan, you can set it to 0.
    #kylin.storage.partition.max-scan-bytes=3221225472
    #
    ## The default coprocessor timeout is (hbase.rpc.timeout * 0.9) / 1000 seconds,
    ## You can set it to a smaller value. 0 means use default.
    ## kylin.storage.hbase.coprocessor-timeout-seconds=0
    #
    ## clean real storage after delete operation
    ## if you want to delete the real storage like htable of deleting segment, you can set it to true
    删除kylin cube的segment后,是否删除对应的htable。建议设置为true。要不然Hbase中会一直保留无效的htable。
    #kylin.storage.clean-after-delete-operation=false
    #
    

    JOB

    • kylin job运行配置,包含重试、并行任务数、邮件通知、job分布式调度等,需要根据实际情况进行配置
    #### JOB ###
    #
    ## Max job retry on error, default 0: no retry
    可以设置kylin job的error重试次数
    #kylin.job.retry=0
    #
    ## Max count of concurrent jobs running
    kylin中同时并行运行job数量
    #kylin.job.max-concurrent-jobs=10
    #
    ## The percentage of the sampling, default 100%
    #kylin.job.sampling-percentage=100
    #
    ## If true, will send email notification on job complete
    这个需要配置,邮件通知内容,建议打开
    ##kylin.job.notification-enabled=true
    ##kylin.job.notification-mail-enable-starttls=true
    ##kylin.job.notification-mail-host=smtp.office365.com
    ##kylin.job.notification-mail-port=587
    ##kylin.job.notification-mail-username=kylin@example.com
    ##kylin.job.notification-mail-password=mypassword
    ##kylin.job.notification-mail-sender=kylin@example.com
    -- 需要设置分布式调度时需要打开此项配置
    #kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
    设置kylin分布式调度时需要修改
    #kylin.job.scheduler.default=0
    #
    

    ENGINE

    • 对默认引擎的Mapreduce配置,可以维持默认配置;
    #### ENGINE ###
    #
    ## Time interval to check hadoop job status
    kylin检查mapreduce任务执行状态
    #kylin.engine.mr.yarn-check-interval-seconds=10
    #
    #kylin.engine.mr.reduce-input-mb=500
    #
    #kylin.engine.mr.max-reducer-number=500
    #
    #kylin.engine.mr.mapper-input-rows=1000000
    #
    ## Enable dictionary building in MR reducer
    #kylin.engine.mr.build-dict-in-reducer=true
    #
    ## Number of reducers for fetching UHC column distinct values
    #kylin.engine.mr.uhc-reducer-count=3
    #
    ## Whether using an additional step to build UHC dictionary
    #kylin.engine.mr.build-uhc-dict-in-additional-step=false
    #
    #
    

    CUBE | DICTIONARY

    • kylin cube和字典配置信息,可以维持默认配置
    #### CUBE | DICTIONARY ###
    #
    #kylin.cube.cuboid-scheduler=org.apache.kylin.cube.cuboid.DefaultCuboidScheduler
    #kylin.cube.segment-advisor=org.apache.kylin.cube.CubeSegmentAdvisor
    #
    ## 'auto', 'inmem', 'layer' or 'random' for testing 
    kylin cube的构建算法
    #kylin.cube.algorithm=layer
    #
    ## A smaller threshold prefers layer, a larger threshold prefers in-mem
    #kylin.cube.algorithm.layer-or-inmem-threshold=7
    #
    ## auto use inmem algorithm:
    ## 1, cube planner optimize job
    ## 2, no source record
    #kylin.cube.algorithm.inmem-auto-optimize=true
    #
    #kylin.cube.aggrgroup.max-combination=32768
    #
    #kylin.snapshot.max-mb=300
    #
    #kylin.cube.cubeplanner.enabled=true
    #kylin.cube.cubeplanner.enabled-for-existing-cube=true
    #kylin.cube.cubeplanner.expansion-threshold=15.0
    #kylin.cube.cubeplanner.recommend-cache-max-size=200
    #kylin.cube.cubeplanner.mandatory-rollup-threshold=1000
    #kylin.cube.cubeplanner.algorithm-threshold-greedy=8
    #kylin.cube.cubeplanner.algorithm-threshold-genetic=23
    #
    #
    

    QUERY

    • kylin查询时配置,建议保持默认,高端用户可以自定义
    #### QUERY ###
    #
    ## Controls the maximum number of bytes a query is allowed to scan storage.
    ## The default value 0 means no limit.
    ## The counterpart kylin.storage.partition.max-scan-bytes sets the maximum per coprocessor.
    #kylin.query.max-scan-bytes=0
    #
    #kylin.query.cache-enabled=true
    #
    ## Controls extras properties for Calcite jdbc driver
    ## all extras properties should undder prefix "kylin.query.calcite.extras-props."
    ## case sensitive, default: true, to enable case insensitive set it to false
    ## @see org.apache.calcite.config.CalciteConnectionProperty.CASE_SENSITIVE
    #kylin.query.calcite.extras-props.caseSensitive=true
    ## how to handle unquoted identity, defualt: TO_UPPER, available options: UNCHANGED, TO_UPPER, TO_LOWER
    ## @see org.apache.calcite.config.CalciteConnectionProperty.UNQUOTED_CASING
    #kylin.query.calcite.extras-props.unquotedCasing=TO_UPPER
    ## quoting method, default: DOUBLE_QUOTE, available options: DOUBLE_QUOTE, BACK_TICK, BRACKET
    ## @see org.apache.calcite.config.CalciteConnectionProperty.QUOTING
    #kylin.query.calcite.extras-props.quoting=DOUBLE_QUOTE
    ## change SqlConformance from DEFAULT to LENIENT to enable group by ordinal
    ## @see org.apache.calcite.sql.validate.SqlConformance.SqlConformanceEnum
    #kylin.query.calcite.extras-props.conformance=LENIENT
    #
    ## TABLE ACL
    #kylin.query.security.table-acl-enabled=true
    #
    ## Usually should not modify this
    #kylin.query.interceptors=org.apache.kylin.rest.security.TableInterceptor
    #
    #kylin.query.escape-default-keyword=false
    #
    ## Usually should not modify this
    #kylin.query.transformers=org.apache.kylin.query.util.DefaultQueryTransformer,org.apache.kylin.query.util.KeywordDefaultDirtyHack
    #
    

    SECURITY

    • kylin web登陆用户的权限管理,功能很强大的。
    #### SECURITY ###
    #
    ## Spring security profile, options: testing, ldap, saml
    ## with "testing" profile, user can use pre-defined name/pwd like KYLIN/ADMIN to login
    #kylin.security.profile=testing
    #
    ## Admin roles in LDAP, for ldap and saml
    #kylin.security.acl.admin-role=admin
    #
    ## LDAP authentication configuration
    #kylin.security.ldap.connection-server=ldap://ldap_server:389
    #kylin.security.ldap.connection-username=
    #kylin.security.ldap.connection-password=
    #
    ## LDAP user account directory;
    #kylin.security.ldap.user-search-base=
    #kylin.security.ldap.user-search-pattern=
    #kylin.security.ldap.user-group-search-base=
    #kylin.security.ldap.user-group-search-filter=(|(member={0})(memberUid={1}))
    #
    ## LDAP service account directory
    #kylin.security.ldap.service-search-base=
    #kylin.security.ldap.service-search-pattern=
    #kylin.security.ldap.service-group-search-base=
    #
    ### SAML configurations for SSO
    ## SAML IDP metadata file location
    #kylin.security.saml.metadata-file=classpath:sso_metadata.xml
    #kylin.security.saml.metadata-entity-base-url=https://hostname/kylin
    #kylin.security.saml.keystore-file=classpath:samlKeystore.jks
    #kylin.security.saml.context-scheme=https
    #kylin.security.saml.context-server-name=hostname
    #kylin.security.saml.context-server-port=443
    #kylin.security.saml.context-path=/kylin
    #
    

    SPARK ENGINE CONFIGS

    • kylin的cube计算引擎为spark的时候,需要根据实际情况进行配置修改
    #### SPARK ENGINE CONFIGS ###
    #
    ## Hadoop conf folder, will export this as "HADOOP_CONF_DIR" to run spark-submit
    ## This must contain site xmls of core, yarn, hive, and hbase in one folder
    ##kylin.env.hadoop-conf-dir=/etc/hadoop/conf
    #
    ## Estimate the RDD partition numbers
    #kylin.engine.spark.rdd-partition-cut-mb=10
    #
    ## Minimal partition numbers of rdd
    #kylin.engine.spark.min-partition=1
    #
    ## Max partition numbers of rdd
    #kylin.engine.spark.max-partition=5000
    #
    ## Spark conf (default is in spark/conf/spark-defaults.conf)
    #kylin.engine.spark-conf.spark.master=yarn
    ##kylin.engine.spark-conf.spark.submit.deployMode=cluster
    #kylin.engine.spark-conf.spark.yarn.queue=default
    #kylin.engine.spark-conf.spark.driver.memory=2G
    #kylin.engine.spark-conf.spark.executor.memory=4G
    #kylin.engine.spark-conf.spark.executor.instances=40
    #kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
    #kylin.engine.spark-conf.spark.shuffle.service.enabled=true
    #kylin.engine.spark-conf.spark.eventLog.enabled=true
    #kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
    #kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
    #kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
    #
    #### Spark conf for specific job
    #kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
    #kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
    #
    ## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime
    ##kylin.engine.spark-conf.spark.yarn.archive=hdfs://namenode:8020/kylin/spark/spark-libs.jar
    ##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
    #
    ## uncomment for HDP
    ##kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
    ##kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
    ##kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current
    #
    #
    

    QUERY PUSH DOWN

    • kylin查询下压引擎,可以设置为implala、prestore等
    #### QUERY PUSH DOWN ###
    #
    ##kylin.query.pushdown.runner-class-name=org.apache.kylin.query.adhoc.PushDownRunnerJdbcImpl
    #
    ##kylin.query.pushdown.update-enabled=false
    ##kylin.query.pushdown.jdbc.url=jdbc:hive2://sandbox:10000/default
    ##kylin.query.pushdown.jdbc.driver=org.apache.hive.jdbc.HiveDriver
    ##kylin.query.pushdown.jdbc.username=hive
    ##kylin.query.pushdown.jdbc.password=
    #
    ##kylin.query.pushdown.jdbc.pool-max-total=8
    ##kylin.query.pushdown.jdbc.pool-max-idle=8
    ##kylin.query.pushdown.jdbc.pool-min-idle=0
    #
    

    JDBC Data Source

    • kylin设置JDBC数据源
    #### JDBC Data Source
    ##kylin.source.jdbc.connection-url=
    ##kylin.source.jdbc.driver=
    ##kylin.source.jdbc.dialect=
    ##kylin.source.jdbc.user=
    ##kylin.source.jdbc.pass=
    ##kylin.source.jdbc.sqoop-home=
    ##kylin.source.jdbc.filed-delimiter=|
    

    Livy with Kylin

    • Livy插件提交spark任务
    #### Livy with Kylin
    ##kylin.engine.livy-conf.livy-enabled=false
    ##kylin.engine.livy-conf.livy-url=http://LivyHost:8998
    ##kylin.engine.livy-conf.livy-key.file=hdfs:///path-to-kylin-job-jar
    ##kylin.engine.livy-conf.livy-arr.jars=hdfs:///path-to-hadoop-dependency-jar
    #
    #
    

    Realtime OLAP

    • 流式OLAP配置,需要根据实际需要进行配置修改
    #### Realtime OLAP ###
    #
    ## Where should local segment cache located, for absolute path, the real path will be ${KYLIN_HOME}/${kylin.stream.index.path}
    #kylin.stream.index.path=stream_index
    #
    ## The timezone for Derived Time Column like hour_start, try set to GMT+N, please check detail at KYLIN-4010
    #kylin.stream.event.timezone=
    #
    ## Debug switch for print realtime global dict encode information, please check detail at KYLIN-4141
    #kylin.stream.print-realtime-dict-enabled=false
    #
    ## Should enable latest coordinator, please check detail at KYLIN-4167
    #kylin.stream.new.coordinator-enabled=true
    #
    ## In which way should we collect receiver's metrics info
    ##kylin.stream.metrics.option=console/csv/jmx
    #
    ## When enable a streaming cube, should cousme from earliest offset or least offset
    #kylin.stream.consume.offsets.latest=true
    #
    ## The parallelism of scan in receiver side
    #kylin.stream.receiver.use-threads-per-query=8
    #
    ## How coordinator/receiver register itself into StreamMetadata, there are three option:
    ## 1. hostname:port, then kylin will set the config ip and port as the currentNode;
    ## 2. port, then kylin will get the node's hostname and append port as the currentNode;
    ## 3. not set, then kylin will get the node hostname address and set the hostname and defaultPort(7070 for coordinator or 9090 for receiver) as the currentNode.
    ##kylin.stream.node=
    #
    ## Auto resubmit after job be discarded
    #kylin.stream.auto-resubmit-after-discard-enabled=true
    

    相关文章

      网友评论

          本文标题:Kylin-配置文件解读kylin.properties

          本文链接:https://www.haomeiwen.com/subject/jsojxhtx.html