Thursday, August 17, 2017

Reading Zip files from spark

  • com-cotdp-hadoop
    • Clone the github repo here: https://github.com/cotdp/com-cotdp-hadoop
    • Edit the pom.xml file
          <modelVersion>4.0.0</modelVersion>
          <groupId>com.cotdp.hadoop</groupId>
          <artifactId>com-cotdp-hadoop</artifactId>
          <packaging>jar</packaging>
          <version>1.0-SNAPSHOT</version>
          <name>com-cotdp-hadoop</name>
          <url>http://cotdp.com/</url>
          <properties>
              <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
          </properties>
          <repositories>
              <!-- HDP Repositories -->
              <repository>
                  <releases>
                      <enabled>true</enabled>
                      <updatePolicy>always</updatePolicy>
                      <checksumPolicy>warn</checksumPolicy>
                  </releases>
                  <snapshots>
                  <enabled>false</enabled>
                  <updatePolicy>never</updatePolicy>
                  <checksumPolicy>fail</checksumPolicy>
                  </snapshots>
                  <id>HDPReleases</id>
                  <name>HDP Releases</name>
                  <layout>default</layout>
              </repository>
              <repository>
                  <releases>
                      <enabled>true</enabled>
                      <updatePolicy>always</updatePolicy>
                      <checksumPolicy>warn</checksumPolicy>
                  </releases>
                  <snapshots>
                      <enabled>true</enabled>
                      <updatePolicy>never</updatePolicy>
                      <checksumPolicy>fail</checksumPolicy>
                  </snapshots>
                  <id>HDPPublic</id>
                  <name>HDP Public</name>
                  <layout>default</layout>
              </repository>
          </repositories>
          <dependencies>
              <dependency>
                  <groupId>org.apache.hadoop</groupId>
                  <artifactId>hadoop-common</artifactId>
                  <version>2.7.3.2.5.0.0-1245</version>
              </dependency>
              <dependency>
                  <groupId>org.apache.hadoop</groupId>
                  <artifactId>hadoop-client</artifactId>
                  <version>2.7.3.2.5.0.0-1245</version>
              </dependency>
              <dependency>
                  <groupId>org.codehaus.jackson</groupId>
                  <artifactId>jackson-mapper-asl</artifactId>
              <version>1.9.8</version>
              </dependency>
              <dependency>
                  <groupId>junit</groupId>
                  <artifactId>junit</artifactId>
                  <version>3.8.1</version>
                  <scope>test</scope>
              </dependency>
          </dependencies>
      </project>
    • Use the command mvn package to build the jar
    • E.g.
      import com.cotdp.hadoop.ZipFileInputFormat
      import org.apache.hadoop.io.{BytesWritable, Text}
      val zipFileRDD = spark.sparkContext.newAPIHadoopFile("hdfs://10.107.149.82:8020/tmp/spark/test.zip", classOf[ZipFileInputFormat],
        classOf[Text],
        classOf[BytesWritable],
        spark.sparkContext.hadoopConfiguration)
      println("The file name in the zipFile is: " + zipFileRDD.map(s => s._1.toString).first())
      println("The file contents are: " + zipFileRDD.map(s =new String(s._2.getBytes, "UTF-8")).first())
    • Limitation
      • It can not process a zip file whiches unzipped file size is bigger than 2GB
    • Reference
  • solr-hadoop-common
  • zipstream