Mungeol Heo: 2018-01-21

Thursday, January 25, 2018

Help 4 Big Data

Spark

'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.

.config("spark.driver.bindAddress", "127.0.0.1")

error: missing or invalid dependency detected while loading class file 'BigQueryDataFrame.class' (spark)
- --packages com.spotify:spark-bigquery_2.11:0.2.2
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List (spark)
- ```
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>18.0</version>
</dependency>
```
Exception in thread "main" org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.plan.RelOptPlanner$CannotPlanException: Node [rel#7:Subset#1.BEAM_LOGICAL.[]] could not be implemented; planner state (beam)
- Use count(1) instead of count(*).
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object; (scio)
- Use the right scala version which scio supports.
java.lang.NoClassDefFoundError: com/google/protobuf/ProtocolStringList (spark)
- It is caused by using old version of protobuf such as probuf-java-2.5.0.jar which does not contain a ProtocolStringList class.
- Spark in HDP 2.6.0 uses probuf-java-2.5.0.jar.
- Use a newer version, like protobuf-java-2.6.1.jar will fix it.
- E.g.
  - Download the protobuf-java-2.6.1.jar.
    - https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/2.6.1/
  - spark-shell --conf spark.executor.userClassPathFirst=true --jars protobuf-java-2.6.1.jar
AbstractLifeCycle: FAILED ServerConnector@X{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: Address already in use (spark)
- Possible reasons
  - Someone is using spark
  - There is a service using 4040
  - There was a spark-shell process which did not exit properlyd
- netstat -lpn | grep 4040
  - tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 26464/java
- ps -ef | grep 26464
- kill -9 26464
java.io.FileNotFoundException: /data/hadoop/hdfs/data/current/VERSION (Permission denied) (hdfs)
- Check the permission
  - ls -al /data/hadoop/hdfs/data/current
- If it is not "hdfs:hadoop"
  - chown -R hdfs:hadoop /data/hadoop/hdfs/data/current
File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist (sqoop)
- mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
- chown -R hdfs:hadoop /hdp
- cd /hdp/apps/2.6.0.3-8/mapreduce
- hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer (spark)
- Add kafka client JAR file
- E.g. spark-shell --jars spark-sql-kafka-0-10_2.11-2.1.0.jar,kafka-clients-0.10.1.2.6.0.3-8.jar
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms (kafka)
- Check the broker's port number from ambari web UI and use the port number instead of 9092, and use the IP address of broker instead of localhost
  - bin/kafka-console-producer.sh --broker-list 10.100.99.152:6667 --topic rep-test
The status of clients are unknown (ambari)
- Try to restart or reinstall the clients
- If it gives a "Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?" error
  - ps -ef | grep apt
  - If there is apt-get or aptitude process, then kill it by using kill command
  - dpkg --configure -a
Amabri does not create/change hive-log4j.properties file in /etc/hive/2.6.0.3-8/0/conf.server or /usr/hdp/current/hive-server2/conf/conf.server (ambari)
- Add a new hive metastore service to the same node which has no hive-log4j.properties file in /etc/hive/2.6.0.3-8/0/conf.server folder
Hive servers in same cluster use different hive-log4j.properties files (hive)
- Option 1
  - Create a hive-log4j.properties file to the /etc/hive/2.6.0.3-8/0/conf.server folder if it is not exists
- Option 2
  - Delete the hive server
  - Create a new hive server in another master node and use it
- Option 3
  - Add a new hive metastore service to the same node which has no hive-log4j.properties file in /etc/hive/2.6.0.3-8/0/conf.server folder
Couldn't find leader offsets (spark)
- Choose right kafka connector
toDF not member of RDD (spark)
- val _spark = org.apache.spark.sql.SparkSession.builder().getOrCreate()
- val _sqlContext = _spark.sqlContext
- import _sqlContext.implicits._
Task Not Serializable (spark)
- val _spark = org.apache.spark.sql.SparkSession.builder().getOrCreate()
- val _sc = _spark.sparkContext
- val _sqlContext = _spark.sqlContext
- import _sqlContext.implicits._
- SparkSession.setActiveSession(_spark)
Distcp creates same folder again (distcp)
- E.g.
  - "hadoop distcp /a/b/target /c/d/target" gives /c/d/target.
  - If run the command again, it gives /c/d/target/target.
- Option
  - use -update option
  - E.g. hadoop distcp -update /a/b/target /c/d/target
TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask' (pyHive)
- Options
  - Specify the user name as hdfs in the Connection method.
- - Check tez.queue.name.
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate($int,WrappedArray()) (spark)
- ERROR Utils: Uncaught exception in thread pool-1-thread-1
  - java.lang.InterruptedException
- WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout,
  - java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException
- Option
  - Set bigger value to "spark.ui.retainedTasks"
resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh /usr/bin/hdp-select set all `ambari-python-wrap /usr/bin/hdp-select versions | grep ^2.6.0.3-8 | tail -1`' returned 1. ERROR: set command takes 2 parameters, instead of 1 (ambari)
- ambari-python-wrap /usr/bin/hdp-select versions
- ERROR: Unexpected file/directory found in /usr/hdp: test
- cd /usr/hdp
- rm -r test
Spark's log level become WARN, even though there is no related config is set (spark)
- It is because there is two --files option are specified while submitting the spark job.
  - E.g. --files a --files b
- Use one --files option will solve the problem.
  - E.g. --files a,b
Failed to add $JARName to Spark environment / java.io.FileNotFoundException: Jar $JARName not found / File file:/path/to/file does not exist (spark)
- Reason
  - This happens when the checkpoint is enabled and the spark job is submitted as client mode at first.
  - There will be these errors while trying to submit in cluster mode at next time.
- Option
  - Delete the checkpoint directory.
  - Submit the job in cluster mode.

WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. (spark)

Using lazily instantiated singleton instance of SparkSession may avoid the warning.

use "val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)" instead of "val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()".

object SparkSessionSingleton {

  @transient  private var instance: SparkSession = _

  def getInstance(sparkConf: SparkConf): SparkSession = {

    if (instance == null) {

      instance = SparkSession

        .builder

        .config(sparkConf)

        .getOrCreate()

    }

    instance

  }

}

Exception in thread "main" java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72) (spark)
- remove "microsoft" from the URL.
  - E.g. use val url="jdbc:sqlserver://19.16.6.5:51051"
java.sql.SQLException: No suitable driver (spark)
- Specify the driver
  - prop.put("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
java.sql.SQLException: No suitable driver found for jdbc:... (spark)
- The JDBC driver class must be visible to the primordial class loader on the client session and on all executors.
- Include the dependency inside the JAR.
- Or, use --driver-class-path and --jar.
org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task (spark)
- Check spark version.
- Make spark dependencies as provided in pom.xml file.
- Remove spark dependencies conflict each other.
java.lang.NoClassDefFoundError: com/yammer/metrics/Metrics (spark)
- --jars /home/hdfs/lib/metrics-core-2.2.0.jar
- KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
error: not found: value kafka (spark)
- --jars /home/hdfs/lib/kafka_2.11-0.8.2.1.jar
- import kafka.serializer.StringDecoder
error: object kafka is not a member of package org.apache.spark.streaming (spark)
- --jars /home/hdfs/lib/spark-streaming-kafka-0-8_2.11-2.1.0.jar
- import org.apache.spark.streaming.kafka.KafkaUtils
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (spark)
- sc.stop
- val ssc = new StreamingContext(conf, Seconds(1))
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.IllegalArgumentException: Missing Hive MetaStore connection URI (sqoop)
- option 1
  - Do not use --as-parquetfile option.
- option 2
  - Change the hive.metastore.uris configuration to only have a single URI.
    - https://issues.cloudera.org/browse/KITE-762
org.kitesdk.data.ValidationException: Dataset name common.metagoods is not alphanumeric (plus '_') (sqoop)
- use --hive-database test --hive-table test
- instead of --hive-table test.test
60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel (HDFS)
- It may be because of the GC problem of the app running on the hadoop cluster.
- E.g. for spark app, remove unnecessary variables and tune the GC.
- Related errors
  - java.io.EOFException: Premature EOF: no length prefix available
  - DataXceiver error processing WRITE_BLOCK operation src: /IP:port dst: /IP:port java.net.SocketTimeoutException
  - java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/IP:port remote=/IP:port]. 60000 millis timeout left
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths (spark)
- If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
- E.g. spark.read.option("basePath", "/test/") .textFile( "/test/code=test1", "/test/code=test2")
java.lang.AssertionError: assertion failed: Conflicting partition column names detected (spark)
- Check the HDFS path.
  - /test/code=test1
  - /test/code=test2/code=test2
- Remove the incorrect partition column
  - E.g. hdfs dfs -rm -r /test/code=test2/code=test2
Very slow "select * from table limit 1" (hive)
- The query is very slow, even hangs or gives errors, if the table is an ORC table with large size files.
- Option 1
  - Save the table as text instead of ORC.
- Option 2
  - Recreate the table with smaller ORC files.
java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hive.ql.io.orc.CompressionKind.zlib (hive)
- Use 'stored as orc tblproperties ("orc.compress"="ZLIB")' instead of 'stored as orc tblproperties ("orc.compress"="zlib")'
- Same as snappy.
SemanticException org.apache.thrift.transport.TTransportException (hive)
- mysql -u hive -p
- use hive;
- select location from SDS;
- Check the service name of hadoop cluster
- If it is wrong, then update it.
The IP address of newly added host is marked as 127.0.1.1 from ambari (ambari)
- The reason is the wrong information in the /etc/hosts file. And the solutions for this problem depend on the use cases.
- Case 1
  - Change "10.10.10.10 test.com test" to "127.0.1.1 test.com test".
  - ambari-agent restart
  - Ambari UI → hosts show 127.0.1.1.
  - Change "127.0.1.1 test.com test" to "10.10.10.10 test.com test".
  - ambari-agent restart
  - Ambari UI → hosts show 10.10.10.10.
- Case 2
  - Comment "127.0.1.1 test".
    - #127.0.1.1 test
  - Change "10.10.10.10 test test.com" to "10.10.10.10 test"
  - ambari-agent restart
  - Ambari UI → hosts shows 10.10.10.10.
The status of newly added host is marked as unknown from ambari (ambari)
- The wrong information in the hosts and hoststate table cause the problem.
- The hosts and hoststate are the tables of RDBMS which ambari using.
- And the wrong information is there is an IP address has two hostname information.
  - E.g. There are "test, 10.10.10.10" and "test.com, 10.10.10.10".
- The reason why "test.com, 10.10.10.10" exists is this information is included in the /etc/hosts file while restarting ambari-agent.
- Solution
  - Remove "test.com, 10.10.10.10" from the two tables.
    - E.g. delete from hoststate where host_id=102; delete from hosts where host_id=102;
  - Remove "10.10.10.10 test.com" from the /etc/hosts file.
  - ambari-agent restart
Failed to connect to server: rm1.host.name/rm1.ip:8032: retries get failed due to exceeded maximum allowed retries number: 0 (YARN)
- The client will show the warning while it tries to connect the RM1 which is at standby status.
DataNode heap usage warning (ambari)
- https://community.hortonworks.com/articles/74076/datanode-high-heap-size-alert.html
- 4GB Heap recommendation
  - -Xms4096m -Xmx4096m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:ParallelGCThreads=8
If no severity is select in ambari alert notification. All severities will be selected (ambari)
- Clear all groups instead
Unable to edit ambari alert notification (ambari)
- Copy the notification, modify it, and save it as new notification. Then delete the old notification.
- Or, restart the ambari server.
Spark gives error if using spark.driver.userClassPathFirst=true and hive enabled in YARN cluster mode (spark)
- Use spark.executor.userClassPathFirst=true with hive enabled.
no viable alternative at input '<EOF>'(line 1, pos 4000) (spark)
- The string length of a column schema cannot exceed 4000.
- Divide the columns in a column into more columns.
- Or, explode the column.
Zeppelin in ambari UI shows stopped, if using command to restart the zepplein (ambari)
- E.g. /usr/hdp/current/zeppelin-server/bin/zeppelin-daemon.sh restart.
- Use the following command instead.
- su -l zeppelin -c "/usr/hdp/current/zeppelin-server/bin/zeppelin-daemon.sh restart"
Hosts show high load average (hadoop)
- Check the status of khugepaged.
- If its status is D, then check the status of the transparent huge page.
- if it is not never, then set THP to never permanently.
- Restart the hosts.
Ambari shows strange or invalid hostname (ambari)
- Delete strange or invalid hostname information from hosts and hoststate tables in the database which ambari using.
  - E.g.
    - delete from hosts where host_id > 12;
    - delete from hoststate where host_id > 12;
- Restart ambari server.
Spark driver shows high CPU usage (spark)
- Increase driver memory.
- Tune spark job.
  - E.g.
    - Unpersist finished variable.
    - Avoid unnecessary variable creation.
Two spark history servers are running with high CPU usage (spark)
- Stop one from the ambari UI first, and kill another.
- Start spark history server from the ambari UI.
Spark history server shows high CPU usage after restarting it (spark)
- Check /spark-history.
- If there are many files, then delete old or all files before restarting spark history server
Newly added column shows NULL in the partitioned hive table with existing data (hive)
- This issue will happen when trying to overwrite the existed partition, even if there is non-null value for the new column.
- And, will show the right value from newly added partitions.
- Options
  - Delete the partitions first, then insert the data
  - Or, add new column with cascade option
    - E.g. alter table table_name add columns (new_column string) cascade;
net.minidev.json.parser.ParseException: Unexpected duplicate key (spark)
- option 1
  - spark.driver.userClassPathFirst
    - (Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only
  - spark.executor.userClassPathFirst
    - (Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances
- option 2
  - cd /usr/hdp/current/spark2-client/jars
  - remove the old minidev library such as json-smart-1.1.1.jar
  - add new minidev libraries such as json-smart-2.3.jar and accessors-smart-1.2.jar
- option 3
  - java.sql.SQLException: No suitable driverSpark stddev functions returns different value from hive or similar function in SQL serverAdd following configuration to the maven-shade-plugin
  - <configuration>
    <relocations>
    <relocation>
    <pattern>net.minidev</pattern>
    <shadedPattern>shaded.net.minidev</shadedPattern>
    </relocation>
    </relocations>
    </configuration>
Invalid dfs.datanode.data.dir /data5/hadoop/hdfs/data : java.io.FileNotFoundException: File file:/data5/hadoop/hdfs/data does not exist (HDFS)
- cd /data5/h
- mkdir hdfs
- cd hdfs
- mkdir data
- chown hdfs:hadoop data/
- start data node
Duplicate key name 'CONSTRAINTS_PARENT_TABLE_ID_INDEX' (hive)
- Login to MySQL
- use hive;
- DROP INDEX CONSTRAINTS_PARENT_TABLE_ID_INDEX ON KEY_CONSTRAINTS;
- DROP TABLE WRITE_SET;
- ALTER TABLE TXN_COMPONENTS DROP COLUMN TC_OPERATION_TYPE;
- ALTER TABLE COMPACTION_QUEUE DROP COLUMN CQ_TBLPROPERTIES;
- ALTER TABLE COMPLETED_COMPACTIONS DROP COLUMN CC_TBLPROPERTIES;
- Reference
  - https://community.hortonworks.com/questions/64068/hive-upgrade-script-200-to-210mysql-error-duplicat.html
java.lang.OutOfMemoryError: Direct buffer memory (hbase)
- Using Ambari, modify your Hbase configuration and blank the following:
  - hbase.bucketcache.size
  - hbase.bucketcache.ioengine
  - hbase.bucketcache.percentage.in.combinedcache
- Modify hbase-env template: comment out line:
  - export HBASE_REGIONSERVER_OPTS = ... -XX:MaxDirectMemorySize Restart all affected
- Reference
  - https://community.hortonworks.com/questions/77119/hbase-master-crashing-after-startup.html
  - https://community.hortonworks.com/questions/9831/region-servers-crash.html
Failure in saving service configuration (ambari)
- Click the one of the old config and click 'Make V{number} Current'
- Restart the service
- Try to modify and save the configuration again
Add JAR file for solving "class not found" kinds of problem in the zeppelin notebook (zeppelin)
- Go to the interpreter configuration page
- Click the edit button on the spark interpreter section
- Add the path of the JAR file to the artifact column of the dependencies
  - E.g. /home/zeppelin/elasticsearch-hadoop-hive-5.0.0.jar
- Click save button
- Click restart button on the spark interpreter section
"class not found" errors while using zeppelin hive notebook (zeppelin)
- Add these jar files listed below to the "/interpreter/jdbc" folder. E.g. /usr/hdp/2.5.3.0-37/zeppelin/interpreter/jdbc
- curator-client-2.7.1.jar
- curator-framework-2.7.1.jar
- hadoop-common-2.7.3.2.5.3.0-37.jar
- hive-common-1.2.1000.2.5.3.0-37.jar
- hive-jdbc-1.2.1000.2.5.3.0-37.jar
- hive-metastore-1.2.1000.2.5.3.0-37.jar
- hive-service-1.2.1000.2.5.3.0-37.jar
- zookeeper-3.4.6.2.5.3.0-37.jar
tool.ImportTool: Encountered IOException running import job: java.io.IOException: Caught Exception checking database column EXCHANGE in hcatalog table (sqoop)
- do not use "EXCHANGE" as column name. rename it.
An error occurred while calling z:java.sql.DriverManager.getConnection. : java.sql.SQLException: java.lang.RuntimeException: java.lang.NullPointerException at (phoenix)
- add ":" symbol after the port number and before the zookeeper node
- e.g. jdbc:phoenix:zookeeper_host:2181:/hbase-unsecure
"'ascii' codec can't encode characters in position XXX: ordinal not in range(128)" (HUE)
- general notebook
  - https://issues.cloudera.org/browse/HUE-4815
  - https://github.com/cloudera/hue/commit/4bce9923bf7201fac3adf8ff89da0a448de70d19
- JDBC notebook
  - vim /usr/local/hue/desktop/libs/notebook/src/notebook/connectors/jdbc.py
  - add "from desktop.lib.i18n import smart_str"
  - modify "message = force_unicode(str(e))" to "message = force_unicode(smart_str(e))"
ambari web UI -> hosts shows inappropriate IP address (ambari)
- check the /etc/hosts file of the host shows inappropriate IP address
- modify the inappropriate contents in the /etc/hosts file
- ambari-agent restart
- (option) you may also need to update the hosts table in the ambari database of the ambari's RDBMS for modifying the inappropriate IP address, and restart the ambari server depends on your use case
Local OS is not compatible with cluster primary OS family. Please perform manual bootstrap on this host (ambari)
- use the commands listed below to confirm your OS version
  - uname -a: for all information regarding the kernel version
  - uname -r: for the exact kernel version
  - lsb_release -a: for all information related to the Ubuntu version
  - lsb_release -r: for the exact version
  - sudo fdisk -l: for partition information with all details
- if the OS versions are different, then reinstall your OS
UNIQUE constraint failed: auth_user.username / 1062, "Duplicate entry 'hue' for key 'username'" (HUE)
- Do not create an account called "hue" at very first time, since HUE will create an account called "hue" when installing examples
X is not allowed to impersonate Y (ambari)
- add the properties, placed below, to core-site
  - hadoop.proxyuser.X.groups = *
  - hadoop.proxyuser.X.hosts = *
Missing Required Header for CSRF protection (HUE)
- disable CSRF for livy via ambari
ConfigObjError: Parsing failed with several errors (HUE)
- invalid syntax exists in the hue.ini
Unauthorized connection for super-user: hcat from IP X.X.X.X (hive)
- add the properties, placed below, to core-site
  - hadoop.proxyuser.hcat.hosts=*
  - hadoop.proxyuser.hcat.groups=*
Queue's AM resource limit exceeded (YARN)
- increase yarn.scheduler.capacity.maximum-am-resource-percent
check and fix under replicated blocks in HDFS (HDFS)
- hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}'
- hadoop fs -setrep 3 file-name
java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit (oozie)
- add spark-assembly-*-hadoop*.jar to the lib directory of
- e.g. spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
Class org.apache.oozie.action.hadoop.SparkMain not found (oozie)
- add oozie-sharelib-spark-*.jar to the lib directory of the oozie app
- e.g. /user/hdfs/oozie-test/lib/oozie-sharelib-spark-4.2.0.2.5.0.0-1245.jar
unknown hosts (ambari)
- delete unknown hosts' data from hosts and hoststate tables of ambari databases
ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning (spark)
- use "--files /usr/hdp/current/spark-client/conf/hive-site.xml"
- instead of "--files /usr/hdp/current/hive-client/conf/hive-site.xml"
Exception in thread “dag-scheduler-event-loop” java.lang.OutOfMemoryError: Java heap space (spark)
- increase the driver's memory using "--driver-memory"
org.apache.spark.sql.AnalysisException: resolved attribute(s) ... HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() windowspecdefinition ... (spark)
- do not use an alias on a column which is already existied
  - e.g. PlatformADID as PlatformADID
- however, it is OK to use alias on a new column
  - e.g. min(test) as min_test
The information of vCores in the resource manager UI is different from the value of "--executor-cores" (spark)
- turn yarn.scheduler.capacity.resource-calculator on, if using HDP
- or, add the property placed below in the capacity-scheduler.xml file
  - <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> </property>
Failed to find datanode, suggest to check cluster health. (HDFS)
- check both forward and reverse DNS setting
- or, add the information of datanodes to the /etc/hosts file
stddev_samp returns NaN (spark)
- option 1: cast(STDDEV_SAMP(column) as decimal(16, 10)) (spark SQL)
- option 2: STDDEV_SAMP = SQRT[N/(N-1)] * STEDDV_POP
spark job is getting slow, almost frozen, OOM-GC (spark)
- try to run a action at a intermediate stage of the job. It may help.
Hive_CLIENT in invalid state. Invalid transition. Invalid event: HOST_SVCCOMP_OP_IN_PROGRESS at INSTALL_FAILED (ambari)
- curl -H "X-Requested-By:ambari" -u admin:admin -X DELETE "http://localhost:8080/api/v1/clusters/hadooptest/hosts/crbsdatanode02.net/host_components/HIVE_CLIENT"
Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found (spark)
- --jars /local/path/to/datanucleus-api-jdo-3.2.6.jar (spark-submit --master yarn-cluster)
Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient (spark)
- --jars /local/path/to/datanucleus-core-3.2.10.jar (spark-submit --master yarn-cluster)
There is no available StoreManager of type "rdbms". Make sure that you have put the relevant DataNucleus store plugin in your CLASSPATH and if defining a connection via JNDI or DataSource you also need to provide persistence property "datanucleus.storeManagerType" (spark)
- --jars /home/hdfs/libs/datanucleus-rdbms-3.2.9.jar (spark-submit --master yarn-cluster)
Database does not exist: user_pattern (spark)
- --files /home/hdfs/libs/hive-site.xml (spark-submit --master yarn-cluster)
failed to start metrics monitor caused by psutil(ambari metrics)
- python /usr/lib/python2.6/site-packages/resource_monitoring/psutil/build.py
No FileSystem for scheme:hdfs (hbase)
- chmod 755 /bin/which
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block (HDFS)
- open ports
  - 50076 or 50075, 50010, 50020
Data is not balanced inside kafka partitions while using kafka sink (flume)
- Reason
  - Kafka Sink uses the topic and key properties from the FlumeEvent headers to send events to Kafka. Iftopicexists in the headers, the event will be sent to that specific topic, overriding the topic configured for the Sink. If key exists in the headers, the key will be used by Kafka to partition the data between the topic partitions. Events with the same key will be sent to the same partition. If the key is null, events will be sent to random partitions.
- Option
  - a1.sources.r1.interceptors = i1
    a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
    a1.sources.r1.interceptors.i1.headerName = key
Cannot produce messages to the specified topic while using both kafka source and sink (flume)
- agent01.sources.source01.interceptors = interceptor01
  agent01.sources.source01.interceptors.interceptor01.type = static
  agent01.sources.source01.interceptors.interceptor01.preserveExisting = false
  agent01.sources.source01.interceptors.interceptor01.key = topic
  agent01.sources.source01.interceptors.interceptor01.value = sink-topic
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block (HDFS)
'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState='08S01' (PyHive)
- use hdfs account, or accounts able to access hive tables to run the python code
mysql_config not found (HUE)
- apt-get install libmysqlclient-dev
Relative path in absolute URI: file:E:/eclipse-neon/workspace/Test/spark-warehouse (spark)
- SparkSession.buider().config("spark.sql.warehouse.dir", "E:/eclipse-neon/workspace/Test/spark-warehouse")
The root scratch dir: /tmp/hive on HDFS should be writable (hadoop)
- %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive
- NOTE: use 'cd' command to change the driver where you run the spark app
java.io.IOException: Could not locate executable null\bin\winutils.exe (hadoop)
- Download the 64-bit winutils.exe (106KB)
  - Direct download link https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe
  - NOTE: there is a different winutils.exe file for the 32-bit Windows and it will not work on the 64-bit OS
- Copy the downloaded file winutils.exe into a folder like C:\hadoop\bin (or C:\spark\hadoop\bin)
- Set the environment variable HADOOP_HOME to point to the above directory but without \bin. For example:
  - if you copied the winutils.exe to C:\hadoop\bin, set HADOOP_HOME=C:\hadoop
  - if you copied the winutils.exe to C:\spark\hadoop\bin, set HADOOP_HOME=C:\spark\hadoop
- Double-check that the environment variable HADOOP_HOME is set properly by opening the Command Prompt and running echo %HADOOP_HOME%
- Restart the eclipse or intellij and try again

Help 4 Other

AttributeError: module 'boto' has no attribute 'plugin' (python)
- pip install google-compute-engine
- Option: restart your VM
OSError: [Errno 1] Operation not permitted (python, mac)
- sudo pip install --user packagename
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools)
- xcode-select --install
An unknown error occurred while loading this notebook. This version can load notebook formats or earlier. See the server log for details. (jupyter)
- Check disk usage
tf.data.Dataset.from_tensor_slices gives "IndexError: list index out of range" (tensorflow)
- the argument should be a list
- E.g. tf.data.Dataset.from_tensor_slices('data/adult.data.csv') → tf.data.Dataset.from_tensor_slices(['data/adult.data.csv'])
pip -V shows 3 while python -V shows 2 (python)
- Option
  - wget https://bootstrap.pypa.io/get-pip.py
  - sudo python get-pip.py
- On the contrary case
  - sudo python3 get-pip.py
DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size> 0` to check that an array is not empty. if diff: (sklearn)
- It is not supposed to do something like if array: to check if an array is empty. However, sklearn does this in the 0.19.1 release.
- Because your version of numpy is recent enough it complains and issues a warning.
- The problem has been fixed in sklearn's current master branch, so I'd expect the fix to be included in the next release.
- Chcek your sklearn version
  - pip show scikit-learn
- Ignoring the warning
  - import warnings
  - warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
Timezone setting (ubuntu)
- For one account
  - ```
  TZ='Asia/Seoul'; export TZ
```
- For system
  - sudo dpkg-reconfigure tzdata
no display name and no $DISPLAY environment variable (matplotlib)
- import matplotlib
  matplotlib.use('Agg')
savefig is cropping the image (matplotlib.pyplot)
- import matplotlib.pyplot as plt
- fig = plt.gcf()
- fig.savefig('test.png', bbox_inches='tight')
Saving plot to a file (matplotlib.pyplot)
- import matplotlib.pyplot as plt
- fig = plt.gcf()
- fig.savefig('test.png')
UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. (matplotlib.pyplot)
- import matplotlib.pyplot as plt
- fig.set_tight_layout(False)
Query text specifies use_legacy_sql:false, while API options specify:true (pandas_gbq)
- Specify dialect option as "standard".
Numerical columns become object type while reading CSV file, which is exported from bigQuery, using pandas.read_csv API (pandas)
- It is because of the CSV file exported from bigQuery.
- Exporting big data from bigQeury will generate more than one CSV files.
- And each CSV file has a header.
- While composing these CSV files to one CSV file, headers will be included.
ContextualVersionConflict: (requests 2.9.1 (/usr/lib/python3/dist-packages), Requirement.parse('requests>=2.18.0'), {'google-cloud-bigquery'}) (jupyter)
- Update the requests dependency.
- If the error still exists, then use "pip show" command to check the version of requests.
  - pip2 show requests
  - pip3 show requests
- Specify right dependencies for Jupyter while restarting it, which is recommended, or update the dependency for related python version.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" (general)
- This warning message is reported when the org.slf4j.impl.StaticLoggerBinder class could not be loaded into memory. This happens when no appropriate SLF4J binding could be found on the classpath. Placing one (and only one) of slf4j-nop.jar slf4j-simple.jar, slf4j-log4j12.jar, slf4j-jdk14.jar or logback-classic.jar on the classpath should solve the problem.
   <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.25</version> </dependency>
ValueError: Variable X already exists, disallowed (tensorflow)
- Use "tf.reset_default_graph()" before any code related to tensorflow
There is "ï»¿" or "\ufeff" in the output (python)
- Use the "utf-8-sig" encoding
  - E.g. with open('data/2.txt', encoding='utf-8-sig') as fp:
UnicodeDecodeError: 'cp949' codec can't decode byte X in position X: illegal multibyte sequence (python)
- Specify the encoding
- E.g. with open('data/2.txt', encoding='latin-1') as fp:
- Or,
- Save the text file with ANSI encoding
TypeError: parse() got an unexpected keyword argument 'transport_encoding' (gym)
- conda install --force html5lib
- pip install gym
error: object Y is not a member of package X (intellij)
- If the error is because of a local jar file, one of the options to solve it is installing the local jar file to the local maven repository.
- Install a local jar in the local maven repository
Unable to lock the administration directory (/var/lib/dpkg/), is another process using it? (linux)
- ps -ef | grep apt
- If there is apt-get or aptitude process, then kill it by usingkillcommand
- dpkg--configure -a
Script does not give expected result while using "sudosu- $user /path/to/script" (linux)
- use "sudosu$user -l -c'sh /path/to/script'" instead.
There are stopped jobs (linux)
- fg

Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes (maven)

Add the following configuration to the maven-shade-plugin.

<configuration>
    <filters>
        <filter>
            <artifact>*:*</artifact>
            <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
            </excludes>
        </filter>
    </filters>
    <!-- Additional configuration. -->
</configuration>

Some dependencies are not included in the JAR (maven)
- remove the following options from maven-shade-plugin
  - <minimizeJar>true</minimizeJar>
Unexpected token 'non-English character' (minidev)
- If the value consists of numbers followed by a non-English character, then it will give the unexpected token error.
  - E.g.
    - {creative_id:04 (23842564830590116),0아:1야}
    - Unexpected token 야 at position 39
- Put the value in brackets
  - E.g.
    - {creative_id:04 (23842564830590116),0아:(1야)}
    - {creative_id:04 (23842564830590116),0아:1(야)}
Strange host name while using openstack (openstack)
- E.g. VM shows the host name as "test.novalocal", even if the host name in /etc/hosts is "test.com" or "hostname test.com" command is used to specify the host name.
- Check the dhcp_domain at nova.conf.
- If it is "dhcp_domain=novalocal", then change it or report it to the related team, like infra.
java.lang.NoClassDefFoundError: net/minidev/asm/F (minidev)
- need "accessors-smart-1.2.jar"
cannot resolve symbol toDS/toDF (intellij)
- import spark.implicits._
Using platform encoding (UTF-8 actually) to copy filtered resources (intellij)
- Add the following content to the pom.xml file
- <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties>
refusing to merge unrelated histories (git)
- Use '--allow-unrelated-histories'
- E.g.
  - Go to your project folder
  - git pull origin branch_name --allow-unrelated-histories
java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef (scala)
- Check the scala version
- E.g., app and compiler
You ($userName) are not allowed to access to (crontab) because of pam configuration. (crontab)
- It is because the password of the user is expired.
- Reset the password will solve the problem.
"'ascii' codec can't encode characters in position XXX: ordinal not in range(128)" (python)
- add "from desktop.lib.i18n import smart_str"
- use smart_str instead of str
no acceptable C compiler found in $PATH (linux)
- CentOS
  - yum groupinstall "Development tools"
- Ubuntu
  - apt-get install build-essential
gmp.h: No such file or directory (ubuntu)
- apt-get install libgmp3-dev
openssl/e_os2.h: No such file or directory (ubuntu)
- apt-get install libssl-dev
Unable to correct problems, you have held broken packages (apt-get)
- aptitude install <packagename>
[Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol (easy_install)
- pip install pyopenssl
[Errno 101] Network is unreachable (easy_install)
- export http_proxy=http://220.95.218.21:3128
- export https_proxy=http://220.95.218.21:3128
[Errno 101] Network is unreachable (pip)
- pip --upgrade pip
ffi.h: No such file or directory (linux)
- centOS: yum -y install libffi-devel
- ubuntu: apt-get install libffi-dev
openssl/opensslv.h: No such file or directory (centos)
- yum -y install openssl-devel
character set collation (SQL server)
- recreate related database or change the collation
  - USE master;
    GO
    ALTER DATABASE MyOptionsTest
    COLLATE French_CI_AS ;
    GO openssl
    
    --Verify the collation setting.
    SELECT name, collation_name
    FROM sys.databases
    WHERE name = N'MyOptionsTest';
    GO
- or, use collate like below
  - SELECT PID COLLATE Korean_100_CI_AS_WS aspid...
  - INNER JOIN #PAY_FROM_REGIST B
    ON A.Person_ID = B.Person_ID COLLATE Korean_Wansung_CI_AS
    AND A.Game_CD = B.Game_CD COLLATE Korean_Wansung_CI_AS
Hostname FQDN repeats the suffix twice (linux)
- e.g.
  - hostname datanade08.net
  - hostname
  - - datanade08.net
  - hostname -f
  - - datanade08.net.net
- check /etc/hostname and /etc/hosts files, and the spelling of the argument which is given to hostname command
  - e.g. the argument which is datanade08.net should be datanode08.net.
Failed to fetch (apt-get)
- cd /etc/apt
- check sources.list and files inside sources.list.d
date is not synchronized even if NTP is running (ntp)
- vim /etc/ntp.conf
- add
  - server 222.122.134.177
  - server 222.122.169.177
- service ntp restart
X must be owned by uid 0 and have the setuid bit set (chmod)
- chmod 4755 X
couldn't connect: Connection refused (apt-key)
- http://keyserver.ubuntu.com/
- search '0x<your key>'
- click url from result
- copy key block to a file. e.g. key.txt
- apt-key add key.txt
- apt-get update
dpkg: warning: files list file for package 'libdbd-mysql-perl' missing (apt-get)
- for package in $(apt-get upgrade 2>&1 | grep "warning: files list file for package '" | grep -Po "[^'\n ]+'" | grep -Po "[^']+"); do apt-get install --reinstall "$package"; done
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(<error number>, '<error message>'))' (pip)
- pip --proxy IP:port install <name>
Proxy URLs must have explicit schemes (pip)
- pip --proxy IP:port install <name>
gcc: error trying to exec 'cc1plus': execvp: No such file or directory (centos)
- yum -y install gcc-c++
fatal error: sasl/sasl.h: No such file or directory (centos)
- yum -y install openldap-devel
- apt-get -y install libsasl2-dev python-dev libldap2-dev libssl-dev
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found (centos)
- yum -y install cyrus-sasl-plain