Monday, August 31, 2015

Behemoth

  1. 2015.08.18
    1. prerequisites
      1. java 1.6
      2. apache maven 2.2.1
      3. internet connection
    2. compiling
      1. git clone https://github.com/DigitalPebble/behemoth.git
      2. cd behemoth
      3. mvn install
      4. mvn test
      5. mvn package
    3. generate a corpus
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i <file or dir> -o output1
      2. ./behemoth importer
    4. extract text
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i output1 -o output2
      2. ./behemoth tika
    5. inspect the corpus
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i output2 -a -c -m -t
      2. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00000
      3. hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00001
      4. ./behemoth reader
    6. extract content from seq files
      1. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2 -o output3
      2. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00000 -o output4
      3. hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00001 -o output5
      4. ./behemoth exporter

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.