- 2015.08.18
- prerequisites
- java 1.6
- apache maven 2.2.1
- internet connection
- compiling
- git clone https://github.com/DigitalPebble/behemoth.git
- cd behemoth
- mvn install
- mvn test
- mvn package
- generate a corpus
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i <file or dir> -o output1
- ./behemoth importer
- extract text
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i output1 -o output2
- ./behemoth tika
- inspect the corpus
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i output2 -a -c -m -t
- hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00000
- hadoop fs -libjars tika/target/behemoth-tika-*-job.jar -text output2/part-00001
- ./behemoth reader
- extract content from seq files
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2 -o output3
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00000 -o output4
- hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.util.ContentExtractor -i output2/part-00001 -o output5
- ./behemoth exporter
- prerequisites
Monday, August 31, 2015
Behemoth
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.