Monday, August 31, 2015

Help 4 Error - Old

  1. bin/flume-ng: line X: syntax error in conditional expression (flume)
    1. add " symbol. for instance,
      1. if [[ $line =~ "^java\.library\.path=(.*)$" ]]; then
    2. or, update bash.
      1. check bash version using 'bash -version' command
  2. warning: unprotected private key file! / permissions 0xxx for 'path/to/id_rsa' are too open (OpenSSH)
    1. chmod 600 path/to/id_rsa
  3. ask for password, even authorized_keys file exists (SSH)
    1. chmod 700 .ssh
    2. chmod 644 .ssh/authorized_keys
  4. failed to recv file (R - scp)
    1. check the file path
  5. no lines available in input / 입력에 가능한 라인들이 없습니다 (R - read.table with pipe)
    1. check the file path
  6. java.net.MalformedURLException: unknown protocol: hdfs (java)
    1. add 'URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());'
  7. argument list too long (curl)
    1. it causes the error, because 'cat base64.txt' returns long content
      1. curl -XPOST "http://localhost:9200/test/person" -d '
        {
        "file" : {
        "_content" : "'`cat base64.txt`'"
        }
        }'
    2. use '@-' to solve the problem
      1. curl -XPOST "http://localhost:9200/test/person" -d @- <<CURL_DATA
        {
        "file" : {
        "_content" : "`cat base64.txt`"
        }
        }
        CURL_DATA
      2. note that you can use any string instead of CURL_DATA, and there is no single quote inside the value of _content this time
  8. invalid target release: 1.7 (maven)
    1. export JAVA_HOME=<java home path>

Behemoth

  1. 2015.08.18
    1. prerequisites
      1. java 1.6
      2. apache maven 2.2.1
      3. internet connection
    2. compiling
      1. git clone https://github.com/DigitalPebble/behemoth.git
      2. cd behemoth
      3. mvn install
      4. mvn test
      5. mvn package
    3. generate a corpus
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i <file or dir> -o output1
      2. ./behemoth importer
    4. extract text
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i output1 -o output2
      2. ./behemoth tika
    5. inspect the corpus
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i output2 -a -c -m -t
      2. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00000
      3. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00001
      4. ./behemoth reader
    6. extract content from seq files
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2 -o output3
      2. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00000 -o output4
      3. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00001 -o output5
      4. ./behemoth exporter

ElasticSearch

  1. 2015.08.14
    1. mapper attachments type for elasticsearch
      1. each node
        1. bin/plugin install elasticsearch/elasticsearch-mapper-attachments/2.4.3
          1. note that 2.4.3 is for ES 1.4
        2. restart
      2. DELETE /test
      3. PUT /test
      4. PUT /test/person/_mapping
        {
        "person" : {
        "properties" : {
        "file" : {
        "type" : "attachment",
        "fields" : {
        "file" : {"term_vector" : "with_positions_offsets", "store": true},
        "title" : {"store" : "yes"},
        "date" : {"store" : "yes"},
        "author" : {"store" : "yes"},
        "keywords" : {"store" : "yes"},
        "content_type" : {"store" : "yes"},
        "content_length" : {"store" : "yes"},
        "language" : {"store" : "yes"}
        }
        }
        }
        }
        }
      5. curl -XPOST "http://localhost:9200/test/person" -d '
        {
        "file" : {
        "_content" : "... base64 encoded attachment ..."
        }
        }'
      6. for long base64
        1. curl -XPOST "http://localhost:9200/test/person" -d @- <<CURL_DATA
          {
          "file" : {
          "_content" : "`base64 my.pdf | perl -pe 's/\n/\\n/g'`"
          }
          }
          CURL_DATA
      7. GET /test/person/_search
        {
        "fields": [ "file.date", "file.title", "file.name", "file.author", "file.keywords", "file.language", "file.cotent_length", "file.content_type", "file" ],
        "query": {
        "match": {
        "file.content_type": "pdf"
        }
        }
        }
  2. 2015.03.03
    1. bashrc
      1. export INNERIP=`hostname -i`
        export ES_HEAP_SIZE=8g
        export ES_CLASSPATH=/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*
    2. configuration
      1. cluster.name: test
      2. node.name: ${HOSTNAME}
      3. transport.host: ${INNERIP}
      4. discovery.zen.ping.multicast.enabled: false
      5. discovery.zen.ping.unicast.hosts: ["10.0.2.a", "10.0.2.b", "10.0.2.c"]
      6. indices.fielddata.cache.size: 40%
  3. 2015.03.02
    1. snapshot and restore
      1. repository register
        1. PUT _snapshot/hdfs
          {
          "type": "hdfs",
          "settings": {
          "path": "/backup/elasticsearch"
          }
          }
      2. repository verification
        1. POST _snapshot/hdfs/_verify
      3. snapshot
        1. PUT _snapshot/hdfs/20150302
      4. monitoring snapshot/restore progress
        1. GET _snapshot/hdfs/20150302/_status
        2. GET _snapshot/hdfs/20150302
      5. snapshot information and status
        1. GET _snapshot/hdfs/20150302
        2. GET _snapshot/hdfs/_all 
        3. GET _snapshot/_status 
        4. GET _snapshot/hdfs/_status 
        5. GET _snapshot/hdfs/20150302/_status
      6. restore
        1. POST _snapshot/hdfs/20150302/_restore
      7. snapshot deletion / stopping currently running snapshot and restore operations
        1. DELETE _snapshot/hdfs/20150302
      8. repository deletion
        1. DELETE _snapshot/hdfs
      9. reference
        1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
    2. rolling update
      1. Disable shard reallocation
        1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }'
      2. Shut down a single node within the cluster
        1. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
      3. Confirm that all shards are correctly reallocated to the remaining running nodes
      4. Download newest version
      5. Extract the zip or tarball to a new directory
      6. Copy the configuration files from the old Elasticsearch installation’s config directory to the new Elasticsearch installation’s config directory
      7. Move data files from the old Elasticsesarch installation’s data directory
      8. Install plugins
      9. Start the now upgraded node
      10. Confirm that it joins the cluster
      11. Re-enable shard reallocation
        1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'
      12. Observe that all shards are properly allocated on all nodes
      13. Repeat this process for all remaining nodes
      14. Reference
        1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-upgrade.html#rolling-upgrades
  4. 2015.02.13
    1. MySQL Slow Query Log Mapping
      1. PUT msql-2015
        {
          "mappings": {
            "log": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock_time": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "query_time": {
                  "type": "double"
                },
                "rows_examined": {
                  "type": "double"
                },
                "rows_sent": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
    2. MySQL Slow Query Dump Mapping
      1. PUT msqld-2015
        {
          "mappings": {
            "dump": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "count": {
                  "type": "double"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "rows": {
                  "type": "double"
                },
                "time": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
  5. 2015.02.12
    1. MySQL Slow Query Log & Dump Mappings
      1. PUT msqld-2015
        {
          "mappings": {
            "log": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock_time": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "query_time": {
                  "type": "double"
                },
                "rows_examined": {
                  "type": "double"
                },
                "rows_sent": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            },
            "dump": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "count": {
                  "type": "double"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "rows": {
                  "type": "double"
                },
                "time": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
  6. 2015.01.19
    1. restart script
      1. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
        sleep 1s
        curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
        sleep 1s
        bin/elasticsearch -d
        sleep 10s
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
  7. ~ 2015.01.01
    1. Commnad
      1. curl 'http://localhost:9200/?pretty'
      2. curl -XPOST 'http://localhost:9200/_shutdown'
      3. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
      4. curl -XPOST 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/_shutdown'
      5. curl -XPOST 'http://localhost:9200/_cluster/nodes/_master/_shutdown'
    2. Configuration
      1. config/elasticsearch.yml
        1. cluster.name
        2. node.name
        3. node.master
        4. node.data
        5. path.*
          1. path.conf: -Des.path.conf
          2. path.data
          3. path.work
          4. path.logs
        6. discovery.zen.ping.multicast.enabled: false
        7. discovery.zen.ping.unicast.hosts
        8. gateway.recover_after_nodes: n
        9. discovery.zen.minimum_master_nodes: (n/2) + 1
        10. action.disable_delete_all_indices: true
        11. action.auto_create_index: false
        12. action.destructive_requires_name: true
        13. index.mapper.dynamic: false
        14. script.disable_dynamic: true
        15. indices.fielddata.cache.size: 40%
      2. dynamic
        1. discovery.zen.minimum_master_nodes
          curl -XPUT localhost:9200/_cluster/settings -d '{
            "persistent" : {
              "discovery.zen.minimum_master_nodes" : (n/2) + 1
            }
          }'
        2. disable _all
          PUT /my_index/_mapping/my_type
          {
              "my_type": {
                  "_all": { "enabled": false }
              }
          }
        3. include_in_all
          PUT /my_index/my_type/_mapping
          {
              "my_type": {
                  "include_in_all": false,
                  "properties": {
                      "title": {
                          "type":           "string",
                          "include_in_all": true
                      },
                      ...
                  }
              }
          }
        4. _alias, _aliases
          PUT /my_index_v1 
          PUT /my_index_v1/_alias/my_index

          POST /_aliases
          {
              "actions": [
                  { "remove": { "index": "my_index_v1", "alias": "my_index" }},
                  { "add":    { "index": "my_index_v2", "alias": "my_index" }}
              ]
          }
        5. refresh_interval (bulk indexing)
          PUT /my_logs
          {
            "settings": {
              "refresh_interval": "30s" 
            }
          }
          POST /my_logs/_settings
          { "refresh_interval": -1 } 
          
          POST /my_logs/_settings
          { "refresh_interval": "1s" } 
        6. flush
          POST /blogs/_flush 
          
          POST /_flush?wait_for_ongoing
        7. optimize
          POST /logstash-old-index/_optimize?max_num_segments=1
        8. filed length norm (for logging)
          PUT /my_index
          {
            "mappings": {
              "doc": {
                "properties": {
                  "text": {
                    "type": "string",
                    "norms": { "enabled": false } 
                  }
                }
              }
            }
          }
        9. tune cluster and index recovery settings (test the value)
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_initial_primary_recoveries":25}}'
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_concurrent_recoveries":5}}'
          ?
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.max_bytes_per_sec":"100mb"}}'
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.concurrent_streams":20}}'
      3. logging.yml
        1. use node.name instead of cluster.name
          file: ${path.logs}/${node.name}.log
      4. elasticsearch.in.sh
        1. disable HeapDumpOnOutOfMemoryError
          #JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
      5. ES_HEAP_SIZE: 50% (< 32g)
        1. export ES_HEAP_SIZE=31g
      6. no swap
        1. bootstrap.mlockall = true
        2. ulimit -l unlimited
      7. thread pools
        1. thread pool size
          1. search - 3 * # of processors (3 * 64 = 192)
          2. index - 2 * # of processors (2 * 64 = 128)
          3. bulk - 3 * # of processors (3 * 64 = 192)
        2. queues - set the size to -1 to prevent rejections from ES
      8. buffers
        1. increased indexing buffer size to 40%
      9. dynamic node.name
        1. ES script
          export ES_NODENMAE=`hostname -s`
        2. elasticsearch.yml
          node.name: "${ES_NODENAME}"
    3. Hardware
      1. CPU
        1. core
      2. disk
        1. SSD
          1. noop / deadline scheduler
          2. better IOPS
          3. cheaper WRT: IOPS
          4. manufacturing tolerance can vary
        2. RAID
          1. do not necessarily need
          2. ES handles redundancy
    4. Monitoring
      1. curl 'localhost:9200/_cluster/health'
      2. curl 'localhost:9200/_nodes/process'
        1. max_file_descriptotrs: 30000?
      3. curl 'localhost:9200/_nodes/jvm'
        1. version
        2. mem.heap_max
      4. curl 'localhost:9200/_nodes/jvm/stats'
        1. heap_used
      5. curl 'localhost:9200/_nodes/indices/stats'
        1. fielddata
      6. curl 'localhost:9200/_nodes/indices/stats?fields=created_on'
        1. fields
      7. curl 'localhost:9200/_nodes/http/stats'
        1. http
      8. GET /_stats/fielddata?fields=*
      9. GET /_nodes/stats/indices/fielddata?fields=*
      10. GET /_nodes/stats/indices/fielddata?level=indices&fields=*
    5. Scenario
      1. adding nodes
        1. disable allocation to stop shard shuffling until ready
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
        2. increase speed of transfers
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
        3. start new nodes
        4. enable allocation
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
      2. removing nodes
        1. exclude the nodes from the cluster, this will tell ES to move things off
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude._name":"node-05*,node-06*"}}'
        2. increase speed of transfers
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
        3. shutdown old nodes after all shards move off
          curl -XPOST 'localhost:9200/_cluster/nodes/node-05*,node-06*/_shutdown'
      3. upgrades / node restarts
        1. disable auto balancing  if doing rolling restarts
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
        2. restart
        3. able auto balancing
          curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
      4. re / bulk indexing
        1. set replicas to 0
        2. increase after completion
      5. configure heap size
        1. heap size setting
        2. export ES_HEAP_SIZE=9g
        3. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
        4. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
        5. bin/elasticsearch -d
        6. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

Zeppelin installation (HDP)

  1. echo $JAVA_HOME
  2. git clone https://github.com/apache/incubator-zeppelin.git
  3. cd incubator-zeppelin
  4. mvn clean install -DskipTests -Pspark-1.3 -Dspark.version=1.3.1 -Phadoop-2.6 -Pyarn
  5. hdp-select status hadoop-client | sed 's/hadoop-client - \(.*\)/\1/'
    1. 2.3.0.0-2557
  6. vim conf/zeppelin-env.sh
    1. export HADOOP_CONF_DIR=/etc/hadoop/conf 
    2. export ZEPPELIN_PORT=10008 
    3. export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.0.0-2557"
  7. cp /etc/hive/conf/hive-site.xml conf/
  8. su hdfs -l -c 'hdfs dfs -mkdir /user/zeppelin;hdfs dfs -chown zeppelin:hdfs /user/zeppelin'
  9. bin/zeppelin-daemon.sh start
  10. http://$host:10008

Maven

  1. 2015.08.12
    1. install maven using yum
      1. wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
      2. yum -y install apache-maven
    2. installation
      1. echo $JAVA_HOME
      2. cd /opt
      3. wget http://apache.tt.co.kr/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.zip
      4. wget http://www.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.zip.asc
      5. wget http://www.apache.org/dist/maven/KEYS
      6. gpg --import KEYS
      7. gpg --verify apache-maven-3.3.3-bin.zip.asc apache-maven-3.3.3-bin.zip
      8. unzip apache-maven-3.3.3-bin.zip
      9. export PATH=/opt/apache-maven-3.3.3/bin:$PATH
      10. mvn -v

Help 4 HDP - Old

  1. caused by: unrecognized locktype: native (solr)
    1. vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
    2. search lockType
    3. set it to hdfs
    4. /opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confname myCollConfigs -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
  2. caused by: direct buffer memory (solr)
    1. vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
    2. search solr.hdfs.blockcache.direct.memory.allocation
    3. set it to false
    4. /opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confname myCollConfigs -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
    5. restart solr
    6. or try 'caused by: java heap space (solr)' directly
  3. caused by: java heap space (solr)
    1. vim /opt/lucidworks-hdpsearch/solr/bin/solr.in.sh
    2. search solr_heap
    3. increase it
    4. restart solr
  4. error: not found: value StructType  (spark)
    1. import org.apache.spark.sql.types._
    2. note that you need to import 'import org.apache.spark.sql.types._' even if 'import org.apache.spark.sql._' is already imported
  5. no response from namenode UI / 50070 is binded to private IP  (hadoop)
    1. ambari web -> HDFS -> configs -> custom core-site -> add property
      1. key: dfs.namenode.http-bind-host
      2. value: 0.0.0.0
    2. save it and restart related services
    3. note that there are 'dfs.namenode.rpc-bind-host', 'dfs.namenode.servicerpc-bind-host' and 'dfs.namenode.https-bind-host' properties which can solve similar issue
  6. root is not allowed to impersonate <username> (hadoop)
    1. ambari web -> HDFS -> configs -> custom core-site -> add property
      1. key: hadoop.proxyuser.root.groups
      2. value: *
      3. key: hadoop.proxyuser.root.hosts
      4. value: *
    2. save it and restart related services
    3. note that you should change root to the user name who runs/submits the service/job
  7. option sql_select_limit=default (ambari)
    1. use latest jdbc driver
      1. cd /usr/share
      2. mkdir java
      3. cd java
      4. wget http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.36.zip
      5. unzip mysql-connector-java-5.1.36.zip
      6. cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar .
      7. ln -s mysql-connector-java-5.1.36-bin.jar mysql-connector-java.jar

HDP Search installation on HDP 2.3

  1. prerequisites
    1. CentOS v6.x / Red Hat Enterprise Linux (RHEL) v6.x / Oracle Linux v6.x
    2. JDK 1.7 or higher
    3. Hortonworks Data Platform (HDP) v2.3
  2. installation
    1. note that solr should be installed on each node that runs HDFS
    2. each node
      1. export JAVA_HOME=/usr/jdk64/jdk1.8.0_40/
      2. ls /etc/yum.repos.d/HDP-UTILS.repo
      3. yum -y install lucidworks-hdpsearch

Help 4 CDH

  1. failed to start name node (hadoop)
    1. check the permission of command(s) which are used to perform name node related processes such as /bin/df
    2. modify permission(s) properly, if necessary
    3. delete the name node dir
    4. retry
    5. it is recommended to check and modify the permission of command(s) before installing HDFS, otherwise HDFS may not be installed correctly
  2. line length exceeds max (flume)
    1. increase deserializer.maxlinelength
      1. agent01.sources.source01.deserializer.maxLineLength
  3. the channel is full (flume)
    1. increase memory capacity
      1. agent01.channels.channel01.capacity
  4. fail to extract date information by using specified date format (flume)
    1. check LANG configuration
      1. LANG="en_US.UTF-8"
  5. failed parsing date from field (logstash)
    1. set locale to en
      1. date {
              locale => en
              match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
        }
  6. could not load ffi provider (logstash)
    1. configure to use another directory instead of /tmp or 
    2. mount -o remount,exec /tmp

Kerberos

  1. 2015.08.03
    1. Introduction
      1. Kerberos is a computer network authentication protocol which works on the basis of 'tickets' to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner
    2. Install a new MIT KDC
      1. yum -y install krb5-server krb5-libs krb5-workstation
      2. vi/etc/krb5.conf
      3. Change the [realms] section of this file by replacing the default “kerberos.example.com” setting for the kdc and admin_server properties with the Fully Qualified Domain Name of the KDC server host
      4. kdb5_util create -s
      5. kadmin.local -q "addprinc admin/admin"
      6. /etc/rc.d/init.d/krb5kdc start 
      7. /etc/rc.d/init.d/kadmin start
      8. chkconfig krb5kdc on 
      9. chkconfig kadmin on

Lucidworks - Connectors

  1. 2015.08.04
    1. hive serde
      1. introduction
        1. The Lucidworks Hive SerDe allows reading and writing data to and from Solr using Apache Hive
      2. example
        1. hive
        2. CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
        3. LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
        4. ADD JAR /opt/lucidworks-hdpsearch/hive/lucidworks-hive-serde-2.0.3.jar;
        5. CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://10.0.2.104:8983/solr', 'solr.collection' = 'myCollection');
        6. INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
        7. solr UI -> core selector -> myCollection_shar1_replica1 -> query -> execute query