CommonCrawl Detected MRI


Text-based Maori Language Detection

The code and its necessary helper files and libraries, and this README, live at:

    http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection

You can checkout from svn with:

    svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection

This checkout contains:

Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it. (These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)

Next extract the apache-opennlp-1.9.1-bin.tar.gz. This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.

Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:

	   cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
	   export OPENNLP_HOME=`pwd`

If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable. Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run: maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/MaoriTextDetector.java

To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, type one of the following:

	   maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector --help
	   (prints the usage, including other options)                  
	 
	   maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector --file <full/path/to/textfile>
	 
	   maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector -
	       which expects text to stream in from standard input.
	       If entering text manually, then remember to press Ctrl-D to indicate the end of StdIn as usual.