Mungeol Heo: 2015-08-30

Monday, August 31, 2015

Help 4 Error - Old

bin/flume-ng: line X: syntax error in conditional expression (flume)
1. add " symbol. for instance,
  1. if [[ $line =~ "^java\.library\.path=(.*)$" ]]; then
2. or, update bash.
  1. check bash version using 'bash -version' command
warning: unprotected private key file! / permissions 0xxx for 'path/to/id_rsa' are too open (OpenSSH)
1. chmod 600 path/to/id_rsa
ask for password, even authorized_keys file exists (SSH)
1. chmod 700 .ssh
2. chmod 644 .ssh/authorized_keys
failed to recv file (R - scp)
1. check the file path
no lines available in input / 입력에 가능한 라인들이 없습니다 (R - read.table with pipe)
1. check the file path
java.net.MalformedURLException: unknown protocol: hdfs (java)
1. add 'URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());'
argument list too long (curl)
1. it causes the error, because 'cat base64.txt' returns long content
  1. curl -XPOST "http://localhost:9200/test/person" -d '
    {
    "file" : {
    "_content" : "'`cat base64.txt`'"
    }
    }'
2. use '@-' to solve the problem
  1. curl -XPOST "http://localhost:9200/test/person" -d @- <<CURL_DATA
    {
    "file" : {
    "_content" : "`cat base64.txt`"
    }
    }
    CURL_DATA
  2. note that you can use any string instead of CURL_DATA, and there is no single quote inside the value of _content this time
invalid target release: 1.7 (maven)
1. export JAVA_HOME=<java home path>

Behemoth

2015.08.18
1. prerequisites
  1. java 1.6
  2. apache maven 2.2.1
  3. internet connection
2. compiling
  1. git clone https://github.com/DigitalPebble/behemoth.git
  2. cd behemoth
  3. mvn install
  4. mvn test
  5. mvn package
3. generate a corpus
  1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i <file or dir> -o output1
  2. ./behemoth importer
4. extract text
  1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i output1 -o output2
  2. ./behemoth tika
5. inspect the corpus
  1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i output2 -a -c -m -t
  2. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00000
  3. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00001
  4. ./behemoth reader
6. extract content from seq files
  1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2 -o output3
  2. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00000 -o output4
  3. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00001 -o output5
  4. ./behemoth exporter

ElasticSearch

2015.08.14
1. mapper attachments type for elasticsearch
  1. each node
    1. bin/plugin install elasticsearch/elasticsearch-mapper-attachments/2.4.3
      1. note that 2.4.3 is for ES 1.4
    2. restart
  2. DELETE /test
  3. PUT /test
  4. PUT /test/person/_mapping
    {
    "person" : {
    "properties" : {
    "file" : {
    "type" : "attachment",
    "fields" : {
    "file" : {"term_vector" : "with_positions_offsets", "store": true},
    "title" : {"store" : "yes"},
    "date" : {"store" : "yes"},
    "author" : {"store" : "yes"},
    "keywords" : {"store" : "yes"},
    "content_type" : {"store" : "yes"},
    "content_length" : {"store" : "yes"},
    "language" : {"store" : "yes"}
    }
    }
    }
    }
    }
  5. curl -XPOST "http://localhost:9200/test/person" -d '
    {
    "file" : {
    "_content" : "... base64 encoded attachment ..."
    }
    }'
  6. for long base64
    1. curl -XPOST "http://localhost:9200/test/person" -d @- <<CURL_DATA
      {
      "file" : {
      "_content" : "`base64 my.pdf | perl -pe 's/\n/\\n/g'`"
      }
      }
      CURL_DATA
  7. GET /test/person/_search
    {
    "fields": [ "file.date", "file.title", "file.name", "file.author", "file.keywords", "file.language", "file.cotent_length", "file.content_type", "file" ],
    "query": {
    "match": {
    "file.content_type": "pdf"
    }
    }
    }
2015.03.03
1. bashrc
  1. export INNERIP=`hostname -i`
    export ES_HEAP_SIZE=8g
    export ES_CLASSPATH=/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*
2. configuration
  1. cluster.name: test
  2. node.name: ${HOSTNAME}
  3. transport.host: ${INNERIP}
  4. discovery.zen.ping.multicast.enabled: false
  5. discovery.zen.ping.unicast.hosts: ["10.0.2.a", "10.0.2.b", "10.0.2.c"]
  6. indices.fielddata.cache.size: 40%
2015.03.02
1. snapshot and restore
  1. repository register
    1. PUT _snapshot/hdfs
      {
      "type": "hdfs",
      "settings": {
      "path": "/backup/elasticsearch"
      }
      }
  2. repository verification
    1. POST _snapshot/hdfs/_verify
  3. snapshot
    1. PUT _snapshot/hdfs/20150302
  4. monitoring snapshot/restore progress
    1. GET _snapshot/hdfs/20150302/_status
    2. GET _snapshot/hdfs/20150302
  5. snapshot information and status
    1. GET _snapshot/hdfs/20150302
    2. GET _snapshot/hdfs/_all
    3. GET _snapshot/_status
    4. GET _snapshot/hdfs/_status
    5. GET _snapshot/hdfs/20150302/_status
  6. restore
    1. POST _snapshot/hdfs/20150302/_restore
  7. snapshot deletion / stopping currently running snapshot and restore operations
    1. DELETE _snapshot/hdfs/20150302
  8. repository deletion
    1. DELETE _snapshot/hdfs
  9. reference
    1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
2. rolling update
  1. Disable shard reallocation
    1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }'
  2. Shut down a single node within the cluster
    1. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
  3. Confirm that all shards are correctly reallocated to the remaining running nodes
  4. Download newest version
  5. Extract the zip or tarball to a new directory
  6. Copy the configuration files from the old Elasticsearch installation’s config directory to the new Elasticsearch installation’s config directory
  7. Move data files from the old Elasticsesarch installation’s data directory
  8. Install plugins
  9. Start the now upgraded node
  10. Confirm that it joins the cluster
  11. Re-enable shard reallocation
    1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'
  12. Observe that all shards are properly allocated on all nodes
  13. Repeat this process for all remaining nodes
  14. Reference
    1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-upgrade.html#rolling-upgrades

2015.02.13

MySQL Slow Query Log Mapping

PUT msql-2015
{
  "mappings": {
    "log": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock_time": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "query_time": {
          "type": "double"
        },
        "rows_examined": {
          "type": "double"
        },
        "rows_sent": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

MySQL Slow Query Dump Mapping

PUT msqld-2015
{
  "mappings": {
    "dump": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "count": {
          "type": "double"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "rows": {
          "type": "double"
        },
        "time": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

2015.02.12

MySQL Slow Query Log & Dump Mappings

PUT msqld-2015
{
  "mappings": {
    "log": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock_time": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "query_time": {
          "type": "double"
        },
        "rows_examined": {
          "type": "double"
        },
        "rows_sent": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    },
    "dump": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "count": {
          "type": "double"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "rows": {
          "type": "double"
        },
        "time": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

2015.01.19

restart script

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
sleep 1s
curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
sleep 1s
bin/elasticsearch -d
sleep 10s
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

~ 2015.01.01
1. Commnad
  1. curl 'http://localhost:9200/?pretty'
  2. curl -XPOST 'http://localhost:9200/_shutdown'
  3. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
  4. curl -XPOST 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/_shutdown'
  5. curl -XPOST 'http://localhost:9200/_cluster/nodes/_master/_shutdown'
2. Configuration
  1. config/elasticsearch.yml
    1. cluster.name
    2. node.name
    3. node.master
    4. node.data
    5. path.*
      1. path.conf: -Des.path.conf
      2. path.data
      3. path.work
      4. path.logs
    6. discovery.zen.ping.multicast.enabled: false
    7. discovery.zen.ping.unicast.hosts
    8. gateway.recover_after_nodes: n
    9. discovery.zen.minimum_master_nodes: (n/2) + 1
    10. action.disable_delete_all_indices: true
    11. action.auto_create_index: false
    12. action.destructive_requires_name: true
    13. index.mapper.dynamic: false
    14. script.disable_dynamic: true
    15. indices.fielddata.cache.size: 40%
  2. dynamic
    1. discovery.zen.minimum_master_nodes
```
curl -XPUT localhost:9200/_cluster/settings -d '{
  "persistent" : {
    "discovery.zen.minimum_master_nodes" : (n/2) + 1
  }
}'
```
    2. disable _all
```
PUT /my_index/_mapping/my_type
{
    "my_type": {
        "_all": { "enabled": false }
    }
}
```
    3. include_in_all
```
PUT /my_index/my_type/_mapping
{
    "my_type": {
        "include_in_all": false,
        "properties": {
            "title": {
                "type":           "string",
                "include_in_all": true
            },
            ...
        }
    }
}
```
    4. _alias, _aliases
```
PUT /my_index_v1 
PUT /my_index_v1/_alias/my_index
```
```
POST /_aliases
{
    "actions": [
        { "remove": { "index": "my_index_v1", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}
```
    5. refresh_interval (bulk indexing)
```
PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s" 
  }
}
```
```
POST /my_logs/_settings
{ "refresh_interval": -1 } 

POST /my_logs/_settings
{ "refresh_interval": "1s" } 
```
    6. flush
```
POST /blogs/_flush 

POST /_flush?wait_for_ongoing
```
    7. optimize
```
POST /logstash-old-index/_optimize?max_num_segments=1
```
    8. filed length norm (for logging)
```
PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "string",
          "norms": { "enabled": false } 
        }
      }
    }
  }
}
```
    9. tune cluster and index recovery settings (test the value)
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_initial_primary_recoveries":25}}'
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_concurrent_recoveries":5}}'
?
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.max_bytes_per_sec":"100mb"}}'
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.concurrent_streams":20}}'
```
  3. logging.yml
    1. use node.name instead of cluster.name
```
file: ${path.logs}/${node.name}.log
```
  4. elasticsearch.in.sh
    1. disable HeapDumpOnOutOfMemoryError
```
#JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
```
  5. ES_HEAP_SIZE: 50% (< 32g)
    1. export ES_HEAP_SIZE=31g
  6. no swap
    1. bootstrap.mlockall = true
    2. ulimit -l unlimited
  7. thread pools
    1. thread pool size
      1. search - 3 * # of processors (3 * 64 = 192)
      2. index - 2 * # of processors (2 * 64 = 128)
      3. bulk - 3 * # of processors (3 * 64 = 192)
    2. queues - set the size to -1 to prevent rejections from ES
  8. buffers
    1. increased indexing buffer size to 40%
  9. dynamic node.name
    1. ES script
```
export ES_NODENMAE=`hostname -s`
```
    2. elasticsearch.yml
```
node.name: "${ES_NODENAME}"
```
3. Hardware
  1. CPU
    1. core
  2. disk
    1. SSD
      1. noop / deadline scheduler
      2. better IOPS
      3. cheaper WRT: IOPS
      4. manufacturing tolerance can vary
    2. RAID
      1. do not necessarily need
      2. ES handles redundancy
4. Monitoring
  1. curl 'localhost:9200/_cluster/health'
  2. curl 'localhost:9200/_nodes/process'
    1. max_file_descriptotrs: 30000?
  3. curl 'localhost:9200/_nodes/jvm'
    1. version
    2. mem.heap_max
  4. curl 'localhost:9200/_nodes/jvm/stats'
    1. heap_used
  5. curl 'localhost:9200/_nodes/indices/stats'
    1. fielddata
  6. curl 'localhost:9200/_nodes/indices/stats?fields=created_on'
    1. fields
  7. curl 'localhost:9200/_nodes/http/stats'
    1. http
  8. GET /_stats/fielddata?fields=*
  9. GET /_nodes/stats/indices/fielddata?fields=*
  10. GET /_nodes/stats/indices/fielddata?level=indices&fields=*
5. Scenario
  1. adding nodes
    1. disable allocation to stop shard shuffling until ready
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
```
    2. increase speed of transfers
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
```
    3. start new nodes
    4. enable allocation
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
```
  2. removing nodes
    1. exclude the nodes from the cluster, this will tell ES to move things off
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude._name":"node-05*,node-06*"}}'
```
    2. increase speed of transfers
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
```
    3. shutdown old nodes after all shards move off
```
curl -XPOST 'localhost:9200/_cluster/nodes/node-05*,node-06*/_shutdown'
```
  3. upgrades / node restarts
    1. disable auto balancing if doing rolling restarts
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
```
    2. restart
    3. able auto balancing
```
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
```
  4. re / bulk indexing
    1. set replicas to 0
    2. increase after completion
  5. configure heap size
    1. heap size setting
    2. export ES_HEAP_SIZE=9g
    3. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
    4. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
    5. bin/elasticsearch -d
    6. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

Zeppelin installation (HDP)

echo $JAVA_HOME
git clone https://github.com/apache/incubator-zeppelin.git
cd incubator-zeppelin
mvn clean install -DskipTests -Pspark-1.3 -Dspark.version=1.3.1 -Phadoop-2.6 -Pyarn
hdp-select status hadoop-client | sed 's/hadoop-client - $.*$/\1/'
1. 2.3.0.0-2557
vim conf/zeppelin-env.sh
1. export HADOOP_CONF_DIR=/etc/hadoop/conf
2. export ZEPPELIN_PORT=10008
3. export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.0.0-2557"
cp /etc/hive/conf/hive-site.xml conf/
su hdfs -l -c 'hdfs dfs -mkdir /user/zeppelin;hdfs dfs -chown zeppelin:hdfs /user/zeppelin'
bin/zeppelin-daemon.sh start
http://$host:10008

Maven

2015.08.12
1. install maven using yum
  1. wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
  2. yum -y install apache-maven
2. installation
  1. echo $JAVA_HOME
  2. cd /opt
  3. wget http://apache.tt.co.kr/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.zip
  4. wget http://www.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.zip.asc
  5. wget http://www.apache.org/dist/maven/KEYS
  6. gpg --import KEYS
  7. gpg --verify apache-maven-3.3.3-bin.zip.asc apache-maven-3.3.3-bin.zip
  8. unzip apache-maven-3.3.3-bin.zip
  9. export PATH=/opt/apache-maven-3.3.3/bin:$PATH
  10. mvn -v

Help 4 HDP - Old

caused by: unrecognized locktype: native (solr)
1. vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
2. search lockType
3. set it to hdfs
4. /opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confname myCollConfigs -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
caused by: direct buffer memory (solr)
1. vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
2. search solr.hdfs.blockcache.direct.memory.allocation
3. set it to false
4. /opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confname myCollConfigs -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf
5. restart solr
6. or try 'caused by: java heap space (solr)' directly
caused by: java heap space (solr)
1. vim /opt/lucidworks-hdpsearch/solr/bin/solr.in.sh
2. search solr_heap
3. increase it
4. restart solr
error: not found: value StructType (spark)
1. import org.apache.spark.sql.types._
2. note that you need to import 'import org.apache.spark.sql.types._' even if 'import org.apache.spark.sql._' is already imported
no response from namenode UI / 50070 is binded to private IP (hadoop)
1. ambari web -> HDFS -> configs -> custom core-site -> add property
  1. key: dfs.namenode.http-bind-host
  2. value: 0.0.0.0
2. save it and restart related services
3. note that there are 'dfs.namenode.rpc-bind-host', 'dfs.namenode.servicerpc-bind-host' and 'dfs.namenode.https-bind-host' properties which can solve similar issue
root is not allowed to impersonate <username> (hadoop)
1. ambari web -> HDFS -> configs -> custom core-site -> add property
  1. key: hadoop.proxyuser.root.groups
  2. value: *
  3. key: hadoop.proxyuser.root.hosts
  4. value: *
2. save it and restart related services
3. note that you should change root to the user name who runs/submits the service/job
option sql_select_limit=default (ambari)
1. use latest jdbc driver
  1. cd /usr/share
  2. mkdir java
  3. cd java
  4. wget http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.36.zip
  5. unzip mysql-connector-java-5.1.36.zip
  6. cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar .
  7. ln -s mysql-connector-java-5.1.36-bin.jar mysql-connector-java.jar

HDP Search installation on HDP 2.3

prerequisites
1. CentOS v6.x / Red Hat Enterprise Linux (RHEL) v6.x / Oracle Linux v6.x
2. JDK 1.7 or higher
3. Hortonworks Data Platform (HDP) v2.3
installation
1. note that solr should be installed on each node that runs HDFS
2. each node
  1. export JAVA_HOME=/usr/jdk64/jdk1.8.0_40/
  2. ls /etc/yum.repos.d/HDP-UTILS.repo
  3. yum -y install lucidworks-hdpsearch

Help 4 CDH

failed to start name node (hadoop)
1. check the permission of command(s) which are used to perform name node related processes such as /bin/df
2. modify permission(s) properly, if necessary
3. delete the name node dir
4. retry
5. it is recommended to check and modify the permission of command(s) before installing HDFS, otherwise HDFS may not be installed correctly
line length exceeds max (flume)
1. increase deserializer.maxlinelength
  1. agent01.sources.source01.deserializer.maxLineLength
the channel is full (flume)
1. increase memory capacity
  1. agent01.channels.channel01.capacity
fail to extract date information by using specified date format (flume)
1. check LANG configuration
  1. LANG="en_US.UTF-8"

failed parsing date from field (logstash)

set locale to en

date {
      locale => en
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}

could not load ffi provider (logstash)
1. configure to use another directory instead of /tmp or
2. mount -o remount,exec /tmp

Kerberos

2015.08.03
1. Introduction
  1. Kerberos is a computer network authentication protocol which works on the basis of 'tickets' to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner
2. Install a new MIT KDC
  1. yum -y install krb5-server krb5-libs krb5-workstation
  2. vi/etc/krb5.conf
  3. Change the [realms] section of this file by replacing the default “kerberos.example.com” setting for the kdc and admin_server properties with the Fully Qualified Domain Name of the KDC server host
  4. kdb5_util create -s
  5. kadmin.local -q "addprinc admin/admin"
  6. /etc/rc.d/init.d/krb5kdc start
  7. /etc/rc.d/init.d/kadmin start
  8. chkconfig krb5kdc on
  9. chkconfig kadmin on

Lucidworks - Connectors

2015.08.04
1. hive serde
  1. introduction
    1. The Lucidworks Hive SerDe allows reading and writing data to and from Solr using Apache Hive
  2. example
    1. hive
    2. CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    3. LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
    4. ADD JAR /opt/lucidworks-hdpsearch/hive/lucidworks-hive-serde-2.0.3.jar;
    5. CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://10.0.2.104:8983/solr', 'solr.collection' = 'myCollection');
    6. INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
    7. solr UI -> core selector -> myCollection_shar1_replica1 -> query -> execute query