So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.
Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:4.0.0 org.acme mahout 0.94 Mahout Examples Scalable machine learning library examples jar UTF-8 0.4 org.apache.maven.plugins maven-compiler-plugin UTF-8 1.6 true org.apache.mahout mahout-core ${apache.mahout.version} org.apache.mahout mahout-math ${apache.mahout.version} org.apache.mahout mahout-utils ${apache.mahout.version} org.slf4j slf4j-api 1.6.0 org.slf4j slf4j-jcl 1.6.0
package org.acme; import java.io.BufferedReader; import java.io.IOException; import java.io.FileReader; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.mahout.classifier.ClassifierResult; import org.apache.mahout.classifier.bayes.TrainClassifier; import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm; import org.apache.mahout.classifier.bayes.common.BayesParameters; import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore; import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException; import org.apache.mahout.classifier.bayes.interfaces.Algorithm; import org.apache.mahout.classifier.bayes.interfaces.Datastore; import org.apache.mahout.classifier.bayes.model.ClassifierContext; import org.apache.mahout.common.nlp.NGrams; public class Starter { public static void main( final String[] args ) { final BayesParameters params = new BayesParameters(); params.setGramSize( 1 ); params.set( "verbose", "true" ); params.set( "classifierType", "bayes" ); params.set( "defaultCat", "OTHER" ); params.set( "encoding", "UTF-8" ); params.set( "alpha_i", "1.0" ); params.set( "dataSource", "hdfs" ); params.set( "basePath", "/tmp/output" ); try { Path input = new Path( "/tmp/input" ); TrainClassifier.trainNaiveBayes( input, "/tmp/output", params ); Algorithm algorithm = new BayesAlgorithm(); Datastore datastore = new InMemoryBayesDatastore( params ); ClassifierContext classifier = new ClassifierContext( algorithm, datastore ); classifier.initialize(); final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) ); String entry = reader.readLine(); while( entry != null ) { List< String > document = new NGrams( entry, Integer.parseInt( params.get( "gramSize" ) ) ) .generateNGramsWithoutLabel(); ClassifierResult result = classifier.classifyDocument( document.toArray( new String[ document.size() ] ), params.get( "defaultCat" ) ); entry = reader.readLine(); } } catch( final IOException ex ) { ex.printStackTrace(); } catch( final InvalidDatastoreException ex ) { ex.printStackTrace(); } } }There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION That's a great suggestion QUESTION Do you sell Microsoft Office? ...More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)
4 comments:
Hi.. Interesting experiment.. but what is args[0] in ur code?
Also how is it to be run? The dependencies are installed using mvn install..
hola,
tienes algun ejemplo utilizando apache mahout muchas gracias
por favor un ejemplo sobre como utilizar mahout
Post a Comment