Mungeol Heo: 2017-08-13

com-cotdp-hadoop

Clone the github repo here: https://github.com/cotdp/com-cotdp-hadoop

Edit the pom.xml file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <groupId>com.cotdp.hadoop</groupId>

    <artifactId>com-cotdp-hadoop</artifactId>

    <packaging>jar</packaging>

    <version>1.0-SNAPSHOT</version>

    <name>com-cotdp-hadoop</name>

    <url>http://cotdp.com/</url>

    <properties>

        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

    </properties>

    <repositories>

        <!-- HDP Repositories -->

        <repository>

            <releases>

                <enabled>true</enabled>

                <updatePolicy>always</updatePolicy>

                <checksumPolicy>warn</checksumPolicy>

            </releases>

            <snapshots>

            <enabled>false</enabled>

            <updatePolicy>never</updatePolicy>

            <checksumPolicy>fail</checksumPolicy>

            </snapshots>

            <id>HDPReleases</id>

            <name>HDP Releases</name>

            <url>http://repo.hortonworks.com/content/repositories/releases/</url>

            <layout>default</layout>

        </repository>

        <repository>

            <releases>

                <enabled>true</enabled>

                <updatePolicy>always</updatePolicy>

                <checksumPolicy>warn</checksumPolicy>

            </releases>

            <snapshots>

                <enabled>true</enabled>

                <updatePolicy>never</updatePolicy>

                <checksumPolicy>fail</checksumPolicy>

            </snapshots>

            <id>HDPPublic</id>

            <name>HDP Public</name>

            <url>http://repo.hortonworks.com/content/repositories/public/</url>

            <layout>default</layout>

        </repository>

    </repositories>

    <dependencies>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-common</artifactId>

            <version>2.7.3.2.5.0.0-1245</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-client</artifactId>

            <version>2.7.3.2.5.0.0-1245</version>

        </dependency>

        <dependency>

            <groupId>org.codehaus.jackson</groupId>

            <artifactId>jackson-mapper-asl</artifactId>

        <version>1.9.8</version>

        </dependency>

        <dependency>

            <groupId>junit</groupId>

            <artifactId>junit</artifactId>

            <version>3.8.1</version>

            <scope>test</scope>

        </dependency>

    </dependencies>

</project>

Use the command mvn package to build the jar

E.g.

import com.cotdp.hadoop.ZipFileInputFormat

import org.apache.hadoop.io.{BytesWritable, Text}

val zipFileRDD = spark.sparkContext.newAPIHadoopFile("hdfs://10.107.149.82:8020/tmp/spark/test.zip", classOf[ZipFileInputFormat],

  classOf[Text],

  classOf[BytesWritable],

  spark.sparkContext.hadoopConfiguration)

println("The file name in the zipFile is: " + zipFileRDD.map(s => s._1.toString).first())

println("The file contents are: " + zipFileRDD.map(s => new String(s._2.getBytes, "UTF-8")).first())

Limitation
- It can not process a zip file whiches unzipped file size is bigger than 2GB
Reference
- https://docs.databricks.com/_static/notebooks/zip-files-scala.html
- https://github.com/bernhard-42/spark-unzip

solr-hadoop-common
- https://github.com/lucidworks/solr-hadoop-common/tree/master/solr-hadoop-io/src/main/java/com/lucidworks/hadoop/utils
- https://github.com/lucidworks/solr-hadoop-common/blob/master/solr-hadoop-io/src/main/java/com/lucidworks/hadoop/io/ZipFileInputFormat.java
zipstream
- https://github.com/tebeka/zipstream

Mungeol Heo

Thursday, August 17, 2017

Reading Zip files from spark