Thursday, August 17, 2017

Reading Zip files from spark

  • com-cotdp-hadoop
    • Clone the github repo here:
    • Edit the pom.xml file
              <!-- HDP Repositories -->
                  <name>HDP Releases</name>
                  <name>HDP Public</name>
    • Use the command mvn package to build the jar
    • E.g.
      import com.cotdp.hadoop.ZipFileInputFormat
      import{BytesWritable, Text}
      val zipFileRDD = spark.sparkContext.newAPIHadoopFile("hdfs://", classOf[ZipFileInputFormat],
      println("The file name in the zipFile is: " + => s._1.toString).first())
      println("The file contents are: " + =new String(s._2.getBytes, "UTF-8")).first())
    • Limitation
      • It can not process a zip file whiches unzipped file size is bigger than 2GB
    • Reference
  • solr-hadoop-common
  • zipstream