TreeTaggerをインストールしてPython2.7で英語の形態素解析をしてみた

英語の形態素解析をするにあたり、TreeTaggerをいれてみました。

NLTK、Stanford NLPというのもあるそうなのですが、なんとなくTreeTaggerを選んでみました。

TreeTaggerをインストール

/usr/local/srcにインストールする場合の例です。公式サイトから必要なものをダウンロードしてインストールします。

$ cd /usr/local/src
$ mkdir tree-tagger
$ cd tree-tagger/
# TreeTaggerのインストールに必要なものをダウンロード
$ wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.tar.gz
$ wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
$ wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/english-par-linux-3.2-utf8.bin.gz
# インストール用シェルスクリプトをダウンロード
$ wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/install-tagger.sh
# パーミッションを変更してシェルスクリプトを実行
$ chmod u+x install-tagger.sh
$ ./install-tagger.sh

必須ではないですが、.bashrcに追記をしてパスを通しておきます。

$ vi ~/.bashrc
    #以下の内容を追記
    PATH="$PATH":/usr/local/src/tree-tagger/cmd
    PATH="$PATH":/usr/local/src/tree-tagger/bin
$ bash

以下が実行できればOKです。

$ echo 'Hello world!' | tree-tagger-english

PythonでTreeTaggerを使う方法

TreeTaggerには有志の人が作成したPythonラッパーがあるのでそれを利用します。

$ pip install treetaggerwrapper

pipでのインストールが成功したら、Pythonから使えることを確かめてみます。

$ echo "This is the sentence." | python -m treetaggerwrapper --pipe --debug

  ERROR Failed to find TreeTagger from automatic directories list.
  ERROR If you installed TreeTagger in a standard place, please contact the treetaggerwrapper author to add this place to this list.
  ERROR To continue working, setup TAGDIR env var to TreeTagger directory.
  ERROR Can't locate TreeTagger directory (and no TAGDIR specified).

TreeTaggerがインストールされているディレクトリが見つからなかったよとのことです。

TreeTaggerWrapperのドキュメントによると、TreeTageerオブジェクトに引数でTAGDIRを指定して渡せば良いそうなので下記のようなexample.pyを作成して実行してみます。

import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en',TAGDIR='/usr/local/src/tree-tagger')
tags = tagger.TagText("This is the sentence.")
for tag in tags:
    print tag

スクリプトを実行。

$ python example.py

すると、以下のようなエラーが発生。Unicodeじゃないとダメだよってことらしいです。

treetaggerwrapper.TreeTaggerError: Must use *unicode* string as text to tag.

Python3だとデフォルトがUnicodeらしいのですが、Python2の場合はu"string"というように「u」をつけてあげる必要があります。

なので、先ほどのスクリプトを以下のように変更します。

tags = tagger.TagText(u"This is the sentence.")

そして、実行。

$ python example.py
  This      DT   this
  is        VBZ  be
  the       DT   the
  sentence  NN   sentence

うまく形態素解析ができました。

TreeTagger公式サイト