Tuesday, November 30, 2010

Getting started with Apache Mahout

Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff - Apache Mahout. Though it's not a very mature one (current version is 0.4), it's very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it's capable to analyze huge data sets.

So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.


  4.0.0
  org.acme
  mahout
  0.94
  Mahout Examples
  Scalable machine learning library examples
  jar

  
    UTF-8
    0.4
  
 
  
    
      
        org.apache.maven.plugins
        maven-compiler-plugin
        
          UTF-8
          1.6
          1.6
          true
        
      
    
  

  
    
      org.apache.mahout
      mahout-core
      ${apache.mahout.version}
    

    
      org.apache.mahout
      mahout-math
      ${apache.mahout.version}
    

    
      org.apache.mahout
      mahout-utils
      ${apache.mahout.version}
    


     
      org.slf4j
      slf4j-api
      1.6.0
    

    
      org.slf4j
      slf4j-jcl
      1.6.0
    
  

Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:
package org.acme;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;

import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;

public class Starter {
 public static void main( final String[] args ) {
  final BayesParameters params = new BayesParameters();
  params.setGramSize( 1 );
  params.set( "verbose", "true" );
  params.set( "classifierType", "bayes" );
  params.set( "defaultCat", "OTHER" );
  params.set( "encoding", "UTF-8" );
  params.set( "alpha_i", "1.0" );
  params.set( "dataSource", "hdfs" );
  params.set( "basePath", "/tmp/output" );
  
  try {
      Path input = new Path( "/tmp/input" );
      TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
   
      Algorithm algorithm = new BayesAlgorithm();
      Datastore datastore = new InMemoryBayesDatastore( params );
      ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
      classifier.initialize();
      
      final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
      String entry = reader.readLine();
      
      while( entry != null ) {
          List< String > document = new NGrams( entry, 
                          Integer.parseInt( params.get( "gramSize" ) ) )
                          .generateNGramsWithoutLabel();

          ClassifierResult result = classifier.classifyDocument( 
                           document.toArray( new String[ document.size() ] ), 
                           params.get( "defaultCat" ) );          

          entry = reader.readLine();
      }
  } catch( final IOException ex ) {
   ex.printStackTrace();
  } catch( final InvalidDatastoreException ex ) {
   ex.printStackTrace();
  }
 }
}
There is one important note here: system must be taught before starting classification. In order to do so, it's necessary to provide examples (more - better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:
SUGGESTION  That's a great suggestion
QUESTION  Do you sell Microsoft Office?
...
More files you can provide, more precise classification you will get. All files must be put to the '/tmp/input' folder, they will be processed by Apache Hadoop first. :)

4 comments:

Tipsy said...

Hi.. Interesting experiment.. but what is args[0] in ur code?

Also how is it to be run? The dependencies are installed using mvn install..

jyotirmoy said...
This comment has been removed by the author.
sergio gomez said...

hola,
tienes algun ejemplo utilizando apache mahout muchas gracias

sergio gomez said...

por favor un ejemplo sobre como utilizar mahout