YoJlのポケット

勉強したことや学んだことを記録しています。

単語分解について

単語分解について

自然言語処理は大まかに以下のプロセスで行われます。

f:id:YoJl:20201209221133p:plain:w400
自然言語処理の流れ

今回はこの中の"単語分解"を行いたいと思います。

1. 日本語と英語の違い

言語は一般的に、膠着語(こうちゃくご)屈折語(くっせつご)に分別されます。
膠着語とは、形態素(いわゆる単語)ごとに分割されずにくっついている(膠着)している言語です。
例として、日本語やフィンランド語が挙げられます。
一方、屈折語とは、形態素(いわゆる単語)ごとに分割されている言語です。
例としては、英語やドイツ語が挙げられます。

一般的に、膠着状態のままだとコンピュータがうまく処理できないため、 膠着語から、屈折語のような形に変える処理を行います。

2. MeCab

MeCabとは、文章を形態素ごとに分解(単語分解)してくれます。 どのように分割するかは、日本語形態素解析の裏側を覗く!MeCab はどのように形態素解析しているかをご覧ください。
簡単に言うと、MeCabは、 辞書を用いて考えられるすべての方法で文章を分割します。
そして、コーパスを用いて、どの分割の方法が最も妥当なのかを計算し、解析結果を出力する、というものです。
また、未知語に対しては、カタカナかアルファベットかなどで区切っているそうです。

3. MeCabの実装

開発には、Google Colaboratoryを使用しています。
まず、MeCabを使うために必要なパッケージをインストールします。

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3

Out [1]Reading package lists... Done Building dependency tree
Reading state information... Done The following additional packages will be installed: aptitude-common libcgi-fast-perl libcgi-pm-perl libclass-accessor-perl libcwidget3v5 libencode-locale-perl libfcgi-perl libhtml-parser-perl libhtml-tagset-perl libhttp-date-perl libhttp-message-perl libio-html-perl libio-string-perl liblwp-mediatypes-perl libparse-debianchangelog-perl libsigc++-2.0-0v5 libsub-name-perl libtimedate-perl liburi-perl libxapian30 swig3.0 Suggested packages: aptitude-doc-en | aptitude-doc apt-xapian-index debtags tasksel libcwidget-dev libdata-dump-perl libhtml-template-perl libxml-simple-perl libwww-perl xapian-tools swig-doc swig-examples swig3.0-examples swig3.0-doc The following NEW packages will be installed: aptitude aptitude-common libcgi-fast-perl libcgi-pm-perl libclass-accessor-perl libcwidget3v5 libencode-locale-perl libfcgi-perl libhtml-parser-perl libhtml-tagset-perl libhttp-date-perl libhttp-message-perl libio-html-perl libio-string-perl liblwp-mediatypes-perl libparse-debianchangelog-perl libsigc++-2.0-0v5 libsub-name-perl libtimedate-perl liburi-perl libxapian30 swig swig3.0 0 upgraded, 23 newly installed, 0 to remove and 14 not upgraded. Need to get 4,978 kB of archives. After this operation, 21.4 MB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 aptitude-common all 0.8.10-6ubuntu1 [1,014 kB] Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsigc++-2.0-0v5 amd64 2.10.0-2 [10.9 kB] Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 libcwidget3v5 amd64 0.5.17-7 [286 kB] Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libxapian30 amd64 1.4.5-1ubuntu0.1 [631 kB] Get:5 http://archive.ubuntu.com/ubuntu bionic/main amd64 aptitude amd64 0.8.10-6ubuntu1 [1,269 kB] Get:6 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhtml-tagset-perl all 3.20-3 [12.1 kB] Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 liburi-perl all 1.73-1 [77.2 kB] Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhtml-parser-perl amd64 3.72-3build1 [85.9 kB] Get:9 http://archive.ubuntu.com/ubuntu bionic/main amd64 libcgi-pm-perl all 4.38-1 [185 kB] Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 libfcgi-perl amd64 0.78-2build1 [32.8 kB] Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 libcgi-fast-perl all 1:2.13-1 [9,940 B] Get:12 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsub-name-perl amd64 0.21-1build1 [11.6 kB] Get:13 http://archive.ubuntu.com/ubuntu bionic/main amd64 libclass-accessor-perl all 0.51-1 [21.2 kB] Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 libencode-locale-perl all 1.05-1 [12.3 kB] Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 libtimedate-perl all 2.3000-2 [37.5 kB] Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhttp-date-perl all 6.02-1 [10.4 kB] Get:17 http://archive.ubuntu.com/ubuntu bionic/main amd64 libio-html-perl all 1.001-1 [14.9 kB] Get:18 http://archive.ubuntu.com/ubuntu bionic/main amd64 liblwp-mediatypes-perl all 6.02-1 [21.7 kB] Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhttp-message-perl all 6.14-1 [72.1 kB] Get:20 http://archive.ubuntu.com/ubuntu bionic/main amd64 libio-string-perl all 1.08-3 [11.1 kB] Get:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 libparse-debianchangelog-perl all 1.2.0-12 [49.5 kB] Get:22 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB] Get:23 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B] Fetched 4,978 kB in 1s (4,159 kB/s) Selecting previously unselected package aptitude-common. (Reading database ... 144793 files and directories currently installed.) Preparing to unpack .../00-aptitude-common_0.8.10-6ubuntu1_all.deb ... Unpacking aptitude-common (0.8.10-6ubuntu1) ... Selecting previously unselected package libsigc++-2.0-0v5:amd64. Preparing to unpack .../01-libsigc++-2.0-0v5_2.10.0-2_amd64.deb ... Unpacking libsigc++-2.0-0v5:amd64 (2.10.0-2) ... Selecting previously unselected package libcwidget3v5:amd64. Preparing to unpack .../02-libcwidget3v5_0.5.17-7_amd64.deb ... Unpacking libcwidget3v5:amd64 (0.5.17-7) ... Selecting previously unselected package libxapian30:amd64. Preparing to unpack .../03-libxapian30_1.4.5-1ubuntu0.1_amd64.deb ... Unpacking libxapian30:amd64 (1.4.5-1ubuntu0.1) ... Selecting previously unselected package aptitude. Preparing to unpack .../04-aptitude_0.8.10-6ubuntu1_amd64.deb ... Unpacking aptitude (0.8.10-6ubuntu1) ... Selecting previously unselected package libhtml-tagset-perl. Preparing to unpack .../05-libhtml-tagset-perl_3.20-3_all.deb ... Unpacking libhtml-tagset-perl (3.20-3) ... Selecting previously unselected package liburi-perl. Preparing to unpack .../06-liburi-perl_1.73-1_all.deb ... Unpacking liburi-perl (1.73-1) ... Selecting previously unselected package libhtml-parser-perl. Preparing to unpack .../07-libhtml-parser-perl_3.72-3build1_amd64.deb ... Unpacking libhtml-parser-perl (3.72-3build1) ... Selecting previously unselected package libcgi-pm-perl. Preparing to unpack .../08-libcgi-pm-perl_4.38-1_all.deb ... Unpacking libcgi-pm-perl (4.38-1) ... Selecting previously unselected package libfcgi-perl. Preparing to unpack .../09-libfcgi-perl_0.78-2build1_amd64.deb ... Unpacking libfcgi-perl (0.78-2build1) ... Selecting previously unselected package libcgi-fast-perl. Preparing to unpack .../10-libcgi-fast-perl_1%3a2.13-1_all.deb ... Unpacking libcgi-fast-perl (1:2.13-1) ... Selecting previously unselected package libsub-name-perl. Preparing to unpack .../11-libsub-name-perl_0.21-1build1_amd64.deb ... Unpacking libsub-name-perl (0.21-1build1) ... Selecting previously unselected package libclass-accessor-perl. Preparing to unpack .../12-libclass-accessor-perl_0.51-1_all.deb ... Unpacking libclass-accessor-perl (0.51-1) ... Selecting previously unselected package libencode-locale-perl. Preparing to unpack .../13-libencode-locale-perl_1.05-1_all.deb ... Unpacking libencode-locale-perl (1.05-1) ... Selecting previously unselected package libtimedate-perl. Preparing to unpack .../14-libtimedate-perl_2.3000-2_all.deb ... Unpacking libtimedate-perl (2.3000-2) ... Selecting previously unselected package libhttp-date-perl. Preparing to unpack .../15-libhttp-date-perl_6.02-1_all.deb ... Unpacking libhttp-date-perl (6.02-1) ... Selecting previously unselected package libio-html-perl. Preparing to unpack .../16-libio-html-perl_1.001-1_all.deb ... Unpacking libio-html-perl (1.001-1) ... Selecting previously unselected package liblwp-mediatypes-perl. Preparing to unpack .../17-liblwp-mediatypes-perl_6.02-1_all.deb ... Unpacking liblwp-mediatypes-perl (6.02-1) ... Selecting previously unselected package libhttp-message-perl. Preparing to unpack .../18-libhttp-message-perl_6.14-1_all.deb ... Unpacking libhttp-message-perl (6.14-1) ... Selecting previously unselected package libio-string-perl. Preparing to unpack .../19-libio-string-perl_1.08-3_all.deb ... Unpacking libio-string-perl (1.08-3) ... Selecting previously unselected package libparse-debianchangelog-perl. Preparing to unpack .../20-libparse-debianchangelog-perl_1.2.0-12_all.deb ... Unpacking libparse-debianchangelog-perl (1.2.0-12) ... Selecting previously unselected package swig3.0. Preparing to unpack .../21-swig3.0_3.0.12-1_amd64.deb ... Unpacking swig3.0 (3.0.12-1) ... Selecting previously unselected package swig. Preparing to unpack .../22-swig_3.0.12-1_amd64.deb ... Unpacking swig (3.0.12-1) ... Setting up libhtml-tagset-perl (3.20-3) ... Setting up libxapian30:amd64 (1.4.5-1ubuntu0.1) ... Setting up swig3.0 (3.0.12-1) ... Setting up libencode-locale-perl (1.05-1) ... Setting up libtimedate-perl (2.3000-2) ... Setting up libio-html-perl (1.001-1) ... Setting up aptitude-common (0.8.10-6ubuntu1) ... Setting up liblwp-mediatypes-perl (6.02-1) ... Setting up liburi-perl (1.73-1) ... Setting up libhtml-parser-perl (3.72-3build1) ... Setting up libcgi-pm-perl (4.38-1) ... Setting up libio-string-perl (1.08-3) ... Setting up libsub-name-perl (0.21-1build1) ... Setting up libfcgi-perl (0.78-2build1) ... Setting up libsigc++-2.0-0v5:amd64 (2.10.0-2) ... Setting up libclass-accessor-perl (0.51-1) ... Setting up swig (3.0.12-1) ... Setting up libhttp-date-perl (6.02-1) ... Setting up libcgi-fast-perl (1:2.13-1) ... Setting up libparse-debianchangelog-perl (1.2.0-12) ... Setting up libhttp-message-perl (6.14-1) ... Setting up libcwidget3v5:amd64 (0.5.17-7) ... Setting up aptitude (0.8.10-6ubuntu1) ... update-alternatives: using /usr/bin/aptitude-curses to provide /usr/bin/aptitude (aptitude) in auto mode Processing triggers for man-db (2.8.3-2ubuntu0.1) ... Processing triggers for libc-bin (2.27-3ubuntu1.2) ... /sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

git is already installed at the requested version (1:2.17.1-1ubuntu0.7) make is already installed at the requested version (4.1-9.1ubuntu1) curl is already installed at the requested version (7.58.0-2ubuntu3.10) xz-utils is already installed at the requested version (5.2.2-1.3) git is already installed at the requested version (1:2.17.1-1ubuntu0.7) make is already installed at the requested version (4.1-9.1ubuntu1) curl is already installed at the requested version (7.58.0-2ubuntu3.10) xz-utils is already installed at the requested version (5.2.2-1.3) The following NEW packages will be installed: file libmagic-mgc{a} libmagic1{a} libmecab-dev libmecab2{a} mecab mecab-ipadic{a} mecab-ipadic-utf8 mecab-jumandic{a} mecab-jumandic-utf8{a} mecab-utils{a} 0 packages upgraded, 11 newly installed, 0 to remove and 14 not upgraded. Need to get 29.3 MB of archives. After unpacking 282 MB will be used. Get: 1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB] Get: 2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1 amd64 1:5.32-2ubuntu0.4 [68.6 kB] Get: 3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 file amd64 1:5.32-2ubuntu0.4 [22.1 kB] Get: 4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab2 amd64 0.996-5 [257 kB] Get: 5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab-dev amd64 0.996-5 [308 kB] Get: 6 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-utils amd64 0.996-5 [4,856 B] Get: 7 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-jumandic-utf8 all 7.0-20130310-4 [16.2 MB] Get: 8 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-jumandic all 7.0-20130310-4 [2,212 B] Get: 9 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-ipadic all 2.7.0-20070801+main-1 [12.1 MB] Get: 10 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab amd64 0.996-5 [132 kB] Get: 11 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-ipadic-utf8 all 2.7.0-20070801+main-1 [3,522 B] Fetched 29.3 MB in 2s (16.8 MB/s) Selecting previously unselected package libmagic-mgc. (Reading database ... 146043 files and directories currently installed.) Preparing to unpack .../00-libmagic-mgc_1%3a5.32-2ubuntu0.4_amd64.deb ... Unpacking libmagic-mgc (1:5.32-2ubuntu0.4) ... Selecting previously unselected package libmagic1:amd64. Preparing to unpack .../01-libmagic1_1%3a5.32-2ubuntu0.4_amd64.deb ... Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.4) ... Selecting previously unselected package file. Preparing to unpack .../02-file_1%3a5.32-2ubuntu0.4_amd64.deb ... Unpacking file (1:5.32-2ubuntu0.4) ... Selecting previously unselected package libmecab2:amd64. Preparing to unpack .../03-libmecab2_0.996-5_amd64.deb ... Unpacking libmecab2:amd64 (0.996-5) ... Selecting previously unselected package libmecab-dev. Preparing to unpack .../04-libmecab-dev_0.996-5_amd64.deb ... Unpacking libmecab-dev (0.996-5) ... Selecting previously unselected package mecab-utils. Preparing to unpack .../05-mecab-utils_0.996-5_amd64.deb ... Unpacking mecab-utils (0.996-5) ... Selecting previously unselected package mecab-jumandic-utf8. Preparing to unpack .../06-mecab-jumandic-utf8_7.0-20130310-4_all.deb ... Unpacking mecab-jumandic-utf8 (7.0-20130310-4) ... Selecting previously unselected package mecab-jumandic. Preparing to unpack .../07-mecab-jumandic_7.0-20130310-4_all.deb ... Unpacking mecab-jumandic (7.0-20130310-4) ... Selecting previously unselected package mecab-ipadic. Preparing to unpack .../08-mecab-ipadic_2.7.0-20070801+main-1_all.deb ... Unpacking mecab-ipadic (2.7.0-20070801+main-1) ... Selecting previously unselected package mecab. Preparing to unpack .../09-mecab_0.996-5_amd64.deb ... Unpacking mecab (0.996-5) ... Selecting previously unselected package mecab-ipadic-utf8. Preparing to unpack .../10-mecab-ipadic-utf8_2.7.0-20070801+main-1_all.deb ... Unpacking mecab-ipadic-utf8 (2.7.0-20070801+main-1) ... Setting up libmecab2:amd64 (0.996-5) ... Setting up libmagic-mgc (1:5.32-2ubuntu0.4) ... Setting up libmagic1:amd64 (1:5.32-2ubuntu0.4) ... Setting up mecab-utils (0.996-5) ... Setting up mecab-ipadic (2.7.0-20070801+main-1) ... Compiling IPA dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/ipadic/unk.def ... 40 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/ipadic/model.def is not found. skipped. reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221 reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795 reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42 reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146 reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477 reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91 reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42 reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750 reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146 reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393 reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032 reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668 reading /usr/share/mecab/dic/ipadic/Others.csv ... 2 reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328 reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151 reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199 reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327 reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171 reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210 reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120 reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135 reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202 reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252 reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19 reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999 reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################|

done! update-alternatives: using /var/lib/mecab/dic/ipadic to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode Setting up libmecab-dev (0.996-5) ... Setting up file (1:5.32-2ubuntu0.4) ... Setting up mecab-jumandic-utf8 (7.0-20130310-4) ... Compiling Juman dictionary for Mecab. reading /usr/share/mecab/dic/juman/unk.def ... 37 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/juman/Noun.keishiki.csv ... 8 reading /usr/share/mecab/dic/juman/Prefix.csv ... 90 reading /usr/share/mecab/dic/juman/Special.csv ... 158 reading /usr/share/mecab/dic/juman/Wikipedia.csv ... 167709 reading /usr/share/mecab/dic/juman/Assert.csv ... 34 reading /usr/share/mecab/dic/juman/Noun.hukusi.csv ... 81 reading /usr/share/mecab/dic/juman/Postp.csv ... 108 reading /usr/share/mecab/dic/juman/Suffix.csv ... 2128 reading /usr/share/mecab/dic/juman/Noun.suusi.csv ... 49 reading /usr/share/mecab/dic/juman/Noun.koyuu.csv ... 7964 reading /usr/share/mecab/dic/juman/AuxV.csv ... 593 reading /usr/share/mecab/dic/juman/Rengo.csv ... 1118 reading /usr/share/mecab/dic/juman/Demonstrative.csv ... 97 reading /usr/share/mecab/dic/juman/ContentW.csv ... 551145 reading /usr/share/mecab/dic/juman/Auto.csv ... 18931 reading /usr/share/mecab/dic/juman/Emoticon.csv ... 972 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/juman/matrix.def ... 1876x1876 emitting matrix : 100% |###########################################|

done! Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-1) ... Compiling IPA dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/ipadic/unk.def ... 40 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/ipadic/model.def is not found. skipped. reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221 reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795 reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42 reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146 reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477 reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91 reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42 reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750 reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146 reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393 reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032 reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668 reading /usr/share/mecab/dic/ipadic/Others.csv ... 2 reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328 reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151 reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199 reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327 reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171 reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210 reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120 reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135 reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202 reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252 reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19 reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999 reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################|

done! update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode Setting up mecab (0.996-5) ... Compiling IPA dictionary for Mecab. This takes long time... reading /usr/share/mecab/dic/ipadic/unk.def ... 40 emitting double-array: 100% |###########################################| /usr/share/mecab/dic/ipadic/model.def is not found. skipped. reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221 reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795 reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42 reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146 reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477 reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91 reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42 reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750 reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146 reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393 reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032 reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668 reading /usr/share/mecab/dic/ipadic/Others.csv ... 2 reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328 reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151 reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199 reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27327 reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171 reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210 reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120 reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135 reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202 reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252 reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19 reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999 reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208 emitting double-array: 100% |###########################################| reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316 emitting matrix : 100% |###########################################|

done! Setting up mecab-jumandic (7.0-20130310-4) ... Processing triggers for man-db (2.8.3-2ubuntu0.1) ... Processing triggers for libc-bin (2.27-3ubuntu1.2) ... /sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

Collecting mecab-python3 Downloading https://files.pythonhosted.org/packages/b4/f0/b57bfb29abd6b898d7137f4a276a338d2565f28a2098d60714388d119f3e/mecab_python3-1.0.3-cp36-cp36m-manylinux1_x86_64.whl (487kB) |████████████████████████████████| 491kB 4.2MB/s Installing collected packages: mecab-python3 Successfully installed mecab-python3-1.0.3

※見なくて良いと思います。

ダウンロードが完了したら、以下を実行します。

import MeCab
sample_txt = "新型コロナウイルスの感染拡大によって、リモートワークが推奨されている。"
m = MeCab.Tagger()
print("Mecab:\n", m.parse(sample_txt))

Out [2]

Mecab:
新型 名詞,一般,*,*,*,*,新型,シンガタ,シンガタ
コロナ 名詞,一般,*,*,*,*,コロナ,コロナ,コロナ
ウイルス 名詞,一般,*,*,*,*,ウイルス,ウイルス,ウイルス
の 助詞,連体化,*,*,*,*,の,ノ,ノ
感染 名詞,サ変接続,*,*,*,*,感染,カンセン,カンセン
拡大 名詞,サ変接続,*,*,*,*,拡大,カクダイ,カクダイ
によって 助詞,格助詞,連語,*,*,*,によって,ニヨッテ,ニヨッテ
、 記号,読点,*,*,*,*,、,、,、
リモート 名詞,一般,*,*,*,*,リモート,リモート,リモート
ワーク 名詞,一般,*,*,*,*,ワーク,ワーク,ワーク
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
推奨 名詞,サ変接続,*,*,*,*,推奨,スイショウ,スイショー
さ 動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れ 動詞,接尾,*,*,一段,連用形,れる,レ,レ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
いる 動詞,非自立,*,*,一段,基本形,いる,イル,イル
。 記号,句点,*,*,*,*,。,。,。
EOS

4. NEologdの導入

上の結果から、
新型コロナウイルス」や「リモートワーク」などの新語には対応していないことが分かります。
そこで、新語にも対応させるために、「NEologd」というシステム辞書を導入します。
詳細は、githubに掲載されています。

まず、NEologdのパッケージをインストールします。

!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

Out [3]Cloning into 'mecab-ipadic-neologd'... remote: Enumerating objects: 75, done. remote: Counting objects: 100% (75/75), done. remote: Compressing objects: 100% (74/74), done. remote: Total 75 (delta 5), reused 54 (delta 0), pack-reused 0 Unpacking objects: 100% (75/75), done. [install-mecab-ipadic-NEologd] : Start.. [install-mecab-ipadic-NEologd] : Check the existance of libraries [install-mecab-ipadic-NEologd] : find => ok [install-mecab-ipadic-NEologd] : sort => ok [install-mecab-ipadic-NEologd] : head => ok [install-mecab-ipadic-NEologd] : cut => ok [install-mecab-ipadic-NEologd] : egrep => ok [install-mecab-ipadic-NEologd] : mecab => ok [install-mecab-ipadic-NEologd] : mecab-config => ok [install-mecab-ipadic-NEologd] : make => ok [install-mecab-ipadic-NEologd] : curl => ok [install-mecab-ipadic-NEologd] : sed => ok [install-mecab-ipadic-NEologd] : cat => ok [install-mecab-ipadic-NEologd] : diff => ok [install-mecab-ipadic-NEologd] : tar => ok [install-mecab-ipadic-NEologd] : unxz => ok [install-mecab-ipadic-NEologd] : xargs => ok [install-mecab-ipadic-NEologd] : grep => ok [install-mecab-ipadic-NEologd] : iconv => ok [install-mecab-ipadic-NEologd] : patch => ok [install-mecab-ipadic-NEologd] : which => ok [install-mecab-ipadic-NEologd] : file => ok [install-mecab-ipadic-NEologd] : openssl => ok [install-mecab-ipadic-NEologd] : awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd [make-mecab-ipadic-NEologd] : Start.. [make-mecab-ipadic-NEologd] : Check local seed directory [make-mecab-ipadic-NEologd] : Check local seed file [make-mecab-ipadic-NEologd] : Check local build directory [make-mecab-ipadic-NEologd] : create /content/mecab-ipadic-neologd/libexec/../build [make-mecab-ipadic-NEologd] : Download original mecab-ipadic file [make-mecab-ipadic-NEologd] : Try to access to https://ja.osdn.net [make-mecab-ipadic-NEologd] : Try to download from https://ja.osdn.net/frs/g_redir.php?m=kent&f=mecab%2Fmecab-ipadic%2F2.7.0-20070801%2Fmecab-ipadic-2.7.0-20070801.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 388 100 388 0 0 245 0 0:00:01 0:00:01 --:--:-- 1013 100 11.6M 100 11.6M 0 0 3490k 0 0:00:03 0:00:03 --:--:-- 8990k Hash value of /content/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz matched [make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file mecab-ipadic-2.7.0-20070801/ mecab-ipadic-2.7.0-20070801/README mecab-ipadic-2.7.0-20070801/AUTHORS mecab-ipadic-2.7.0-20070801/COPYING mecab-ipadic-2.7.0-20070801/ChangeLog mecab-ipadic-2.7.0-20070801/INSTALL mecab-ipadic-2.7.0-20070801/Makefile.am mecab-ipadic-2.7.0-20070801/Makefile.in mecab-ipadic-2.7.0-20070801/NEWS mecab-ipadic-2.7.0-20070801/aclocal.m4 mecab-ipadic-2.7.0-20070801/config.guess mecab-ipadic-2.7.0-20070801/config.sub mecab-ipadic-2.7.0-20070801/configure mecab-ipadic-2.7.0-20070801/configure.in mecab-ipadic-2.7.0-20070801/install-sh mecab-ipadic-2.7.0-20070801/missing mecab-ipadic-2.7.0-20070801/mkinstalldirs mecab-ipadic-2.7.0-20070801/Adj.csv mecab-ipadic-2.7.0-20070801/Adnominal.csv mecab-ipadic-2.7.0-20070801/Adverb.csv mecab-ipadic-2.7.0-20070801/Auxil.csv mecab-ipadic-2.7.0-20070801/Conjunction.csv mecab-ipadic-2.7.0-20070801/Filler.csv mecab-ipadic-2.7.0-20070801/Interjection.csv mecab-ipadic-2.7.0-20070801/Noun.adjv.csv mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv mecab-ipadic-2.7.0-20070801/Noun.csv mecab-ipadic-2.7.0-20070801/Noun.demonst.csv mecab-ipadic-2.7.0-20070801/Noun.nai.csv mecab-ipadic-2.7.0-20070801/Noun.name.csv mecab-ipadic-2.7.0-20070801/Noun.number.csv mecab-ipadic-2.7.0-20070801/Noun.org.csv mecab-ipadic-2.7.0-20070801/Noun.others.csv mecab-ipadic-2.7.0-20070801/Noun.place.csv mecab-ipadic-2.7.0-20070801/Noun.proper.csv mecab-ipadic-2.7.0-20070801/Noun.verbal.csv mecab-ipadic-2.7.0-20070801/Others.csv mecab-ipadic-2.7.0-20070801/Postp-col.csv mecab-ipadic-2.7.0-20070801/Postp.csv mecab-ipadic-2.7.0-20070801/Prefix.csv mecab-ipadic-2.7.0-20070801/Suffix.csv mecab-ipadic-2.7.0-20070801/Symbol.csv mecab-ipadic-2.7.0-20070801/Verb.csv mecab-ipadic-2.7.0-20070801/char.def mecab-ipadic-2.7.0-20070801/feature.def mecab-ipadic-2.7.0-20070801/left-id.def mecab-ipadic-2.7.0-20070801/matrix.def mecab-ipadic-2.7.0-20070801/pos-id.def mecab-ipadic-2.7.0-20070801/rewrite.def mecab-ipadic-2.7.0-20070801/right-id.def mecab-ipadic-2.7.0-20070801/unk.def mecab-ipadic-2.7.0-20070801/dicrc mecab-ipadic-2.7.0-20070801/RESULT [make-mecab-ipadic-NEologd] : Configure custom system dictionary on /content/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200910 checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking whether make sets $(MAKE)... yes checking for working aclocal-1.4... missing checking for working autoconf... missing checking for working automake-1.4... missing checking for working autoheader... missing checking for working makeinfo... missing checking for a BSD-compatible install... /usr/bin/install -c checking for mecab-config... /usr/bin/mecab-config configure: creating ./config.status config.status: creating Makefile [make-mecab-ipadic-NEologd] : Encode the character encoding of system dictionary resources from EUC_JP to UTF-8 ./../../libexec/iconv_euc_to_utf8.sh ./Noun.adverbal.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.proper.csv ./../../libexec/iconv_euc_to_utf8.sh ./Others.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.verbal.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.org.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.nai.csv ./../../libexec/iconv_euc_to_utf8.sh ./Interjection.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.place.csv ./../../libexec/iconv_euc_to_utf8.sh ./Verb.csv ./../../libexec/iconv_euc_to_utf8.sh ./Filler.csv ./../../libexec/iconv_euc_to_utf8.sh ./Adverb.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.number.csv ./../../libexec/iconv_euc_to_utf8.sh ./Postp.csv ./../../libexec/iconv_euc_to_utf8.sh ./Postp-col.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.adjv.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.others.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.demonst.csv ./../../libexec/iconv_euc_to_utf8.sh ./Adj.csv ./../../libexec/iconv_euc_to_utf8.sh ./Prefix.csv ./../../libexec/iconv_euc_to_utf8.sh ./Adnominal.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.csv ./../../libexec/iconv_euc_to_utf8.sh ./Auxil.csv ./../../libexec/iconv_euc_to_utf8.sh ./Suffix.csv ./../../libexec/iconv_euc_to_utf8.sh ./Symbol.csv ./../../libexec/iconv_euc_to_utf8.sh ./Noun.name.csv ./../../libexec/iconv_euc_to_utf8.sh ./Conjunction.csv rm ./Noun.adverbal.csv rm ./Noun.proper.csv rm ./Others.csv rm ./Noun.verbal.csv rm ./Noun.org.csv rm ./Noun.nai.csv rm ./Interjection.csv rm ./Noun.place.csv rm ./Verb.csv rm ./Filler.csv rm ./Adverb.csv rm ./Noun.number.csv rm ./Postp.csv rm ./Postp-col.csv rm ./Noun.adjv.csv rm ./Noun.others.csv rm ./Noun.demonst.csv rm ./Adj.csv rm ./Prefix.csv rm ./Adnominal.csv rm ./Noun.csv rm ./Auxil.csv rm ./Suffix.csv rm ./Symbol.csv rm ./Noun.name.csv rm ./Conjunction.csv ./../../libexec/iconv_euc_to_utf8.sh ./right-id.def ./../../libexec/iconv_euc_to_utf8.sh ./feature.def ./../../libexec/iconv_euc_to_utf8.sh ./unk.def ./../../libexec/iconv_euc_to_utf8.sh ./pos-id.def ./../../libexec/iconv_euc_to_utf8.sh ./rewrite.def ./../../libexec/iconv_euc_to_utf8.sh ./char.def ./../../libexec/iconv_euc_to_utf8.sh ./matrix.def ./../../libexec/iconv_euc_to_utf8.sh ./left-id.def rm ./right-id.def rm ./feature.def rm ./unk.def rm ./pos-id.def rm ./rewrite.def rm ./char.def rm ./matrix.def rm ./left-id.def mv ./rewrite.def.utf8 ./rewrite.def mv ./right-id.def.utf8 ./right-id.def mv ./Noun.verbal.csv.utf8 ./Noun.verbal.csv mv ./Noun.adjv.csv.utf8 ./Noun.adjv.csv mv ./Noun.others.csv.utf8 ./Noun.others.csv mv ./Noun.place.csv.utf8 ./Noun.place.csv mv ./matrix.def.utf8 ./matrix.def mv ./Postp-col.csv.utf8 ./Postp-col.csv mv ./Noun.demonst.csv.utf8 ./Noun.demonst.csv mv ./Noun.name.csv.utf8 ./Noun.name.csv mv ./Interjection.csv.utf8 ./Interjection.csv mv ./left-id.def.utf8 ./left-id.def mv ./char.def.utf8 ./char.def mv ./Others.csv.utf8 ./Others.csv mv ./Adverb.csv.utf8 ./Adverb.csv mv ./Noun.org.csv.utf8 ./Noun.org.csv mv ./Adj.csv.utf8 ./Adj.csv mv ./Postp.csv.utf8 ./Postp.csv mv ./Verb.csv.utf8 ./Verb.csv mv ./Noun.proper.csv.utf8 ./Noun.proper.csv mv ./Suffix.csv.utf8 ./Suffix.csv mv ./Noun.number.csv.utf8 ./Noun.number.csv mv ./Auxil.csv.utf8 ./Auxil.csv mv ./Filler.csv.utf8 ./Filler.csv mv ./Noun.nai.csv.utf8 ./Noun.nai.csv mv ./Adnominal.csv.utf8 ./Adnominal.csv mv ./pos-id.def.utf8 ./pos-id.def mv ./feature.def.utf8 ./feature.def mv ./Prefix.csv.utf8 ./Prefix.csv mv ./Symbol.csv.utf8 ./Symbol.csv mv ./Noun.csv.utf8 ./Noun.csv mv ./unk.def.utf8 ./unk.def mv ./Conjunction.csv.utf8 ./Conjunction.csv mv ./Noun.adverbal.csv.utf8 ./Noun.adverbal.csv [make-mecab-ipadic-NEologd] : Fix yomigana field of IPA dictionary patching file Noun.csv patching file Noun.place.csv patching file Verb.csv patching file Noun.verbal.csv patching file Noun.name.csv patching file Noun.adverbal.csv patching file Noun.csv patching file Noun.name.csv patching file Noun.org.csv patching file Noun.others.csv patching file Noun.place.csv patching file Noun.proper.csv patching file Noun.verbal.csv patching file Prefix.csv patching file Suffix.csv patching file Noun.proper.csv patching file Noun.csv patching file Noun.name.csv patching file Noun.org.csv patching file Noun.place.csv patching file Noun.proper.csv patching file Noun.verbal.csv patching file Noun.name.csv patching file Noun.org.csv patching file Noun.place.csv patching file Noun.proper.csv patching file Suffix.csv patching file Noun.demonst.csv patching file Noun.csv patching file Noun.name.csv [make-mecab-ipadic-NEologd] : Copy user dictionary resource [make-mecab-ipadic-NEologd] : Install adverb entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adverb-dict-seed.20150623.csv.xz [make-mecab-ipadic-NEologd] : Install interjection entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-interjection-dict-seed.20170216.csv.xz [make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-common-noun-ortho-variant-dict-seed.20170228.csv.xz [make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-proper-noun-ortho-variant-dict-seed.20161110.csv.xz [make-mecab-ipadic-NEologd] : Install entries of orthographic variant of a noun used as verb form using /content/mecab-ipadic-neologd/libexec/../seed/neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv.xz [make-mecab-ipadic-NEologd] : Install frequent adjective orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-std-dict-seed.20151126.csv.xz [make-mecab-ipadic-NEologd] : Install infrequent adjective orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-exp-dict-seed.20151126.csv.xz [make-mecab-ipadic-NEologd] : Install adjective verb orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-verb-dict-seed.20160324.csv.xz [make-mecab-ipadic-NEologd] : Install infrequent datetime representation entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-date-time-infreq-dict-seed.20190415.csv.xz [make-mecab-ipadic-NEologd] : Install infrequent quantity representation entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-quantity-infreq-dict-seed.20190415.csv.xz [make-mecab-ipadic-NEologd] : Install entries of ill formed words using /content/mecab-ipadic-neologd/libexec/../seed/neologd-ill-formed-words-dict-seed.20170127.csv.xz [make-mecab-ipadic-NEologd] : Re-Index system dictionary reading ./unk.def ... 40 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. reading ./neologd-adjective-exp-dict-seed.20151126.csv ... 1051146 reading ./neologd-common-noun-ortho-variant-dict-seed.20170228.csv ... 152869 reading ./Noun.adverbal.csv ... 808 reading ./Noun.proper.csv ... 27493 reading ./Others.csv ... 2 reading ./Noun.verbal.csv ... 12150 reading ./neologd-proper-noun-ortho-variant-dict-seed.20161110.csv ... 138379 reading ./neologd-ill-formed-words-dict-seed.20170127.csv ... 60616 reading ./Noun.org.csv ... 17149 reading ./Noun.nai.csv ... 42 reading ./Interjection.csv ... 252 reading ./Noun.place.csv ... 73194 reading ./Verb.csv ... 130750 reading ./neologd-quantity-infreq-dict-seed.20190415.csv ... 229216 reading ./Filler.csv ... 19 reading ./Adverb.csv ... 3032 reading ./Noun.number.csv ... 42 reading ./Postp.csv ... 146 reading ./Postp-col.csv ... 91 reading ./neologd-adverb-dict-seed.20150623.csv ... 139792 reading ./Noun.adjv.csv ... 3328 reading ./neologd-date-time-infreq-dict-seed.20190415.csv ... 16866 reading ./Noun.others.csv ... 153 reading ./Noun.demonst.csv ... 120 reading ./neologd-adjective-std-dict-seed.20151126.csv ... 507812 reading ./Adj.csv ... 27210 reading ./Prefix.csv ... 224 reading ./Adnominal.csv ... 135 reading ./Noun.csv ... 60734 reading ./Auxil.csv ... 199 reading ./neologd-adjective-verb-dict-seed.20160324.csv ... 20268 reading ./Suffix.csv ... 1448 reading ./Symbol.csv ... 208 reading ./mecab-user-dict-seed.20200910.csv ... 3224584 reading ./Noun.name.csv ... 34215 reading ./neologd-interjection-dict-seed.20170216.csv ... 4701 reading ./neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv ... 26058 reading ./Conjunction.csv ... 171 emitting double-array: 100% |###########################################| reading ./matrix.def ... 1316x1316 emitting matrix : 100% |###########################################|

done! [make-mecab-ipadic-NEologd] : Make custom system dictionary on /content/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200910 make: Nothing to be done for 'all'. [make-mecab-ipadic-NEologd] : Finish.. [install-mecab-ipadic-NEologd] : Get results of tokenize test [test-mecab-ipadic-NEologd] : Start.. [test-mecab-ipadic-NEologd] : Replace timestamp from 'git clone' date to 'git commit' date [test-mecab-ipadic-NEologd] : Get buzz phrases % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 31914 0 31914 0 0 24473 0 --:--:-- 0:00:01 --:--:-- 24473 [test-mecab-ipadic-NEologd] : Get difference between default system dictionary and mecab-ipadic-NEologd [test-mecab-ipadic-NEologd] : Something wrong. You shouldn't install mecab-ipadic-NEologd yet. [test-mecab-ipadic-NEologd] : Finish..

[install-mecab-ipadic-NEologd] : Please check the list of differences in the upper part.

[install-mecab-ipadic-NEologd] : Do you want to install mecab-ipadic-NEologd? Type yes or no. [install-mecab-ipadic-NEologd] : OK. Let's install mecab-ipadic-NEologd. [install-mecab-ipadic-NEologd] : Start.. [install-mecab-ipadic-NEologd] : /usr/lib/x86_64-linux-gnu/mecab/dic isn't current user's directory [install-mecab-ipadic-NEologd] : Sudo make install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd make[1]: Entering directory '/content/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20200910' make[1]: Nothing to be done for 'install-exec-am'. /bin/bash ./mkinstalldirs /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd mkdir /usr/lib/x86_64-linux-gnu/mecab mkdir /usr/lib/x86_64-linux-gnu/mecab/dic mkdir /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd /usr/bin/install -c -m 644 ./matrix.bin /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/matrix.bin /usr/bin/install -c -m 644 ./char.bin /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/char.bin /usr/bin/install -c -m 644 ./sys.dic /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/sys.dic /usr/bin/install -c -m 644 ./unk.dic /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/unk.dic /usr/bin/install -c -m 644 ./left-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/left-id.def /usr/bin/install -c -m 644 ./right-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/right-id.def /usr/bin/install -c -m 644 ./rewrite.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/rewrite.def /usr/bin/install -c -m 644 ./pos-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/pos-id.def /usr/bin/install -c -m 644 ./dicrc /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/dicrc make[1]: Leaving directory '/content/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20200910'

[install-mecab-ipadic-NEologd] : Install completed. [install-mecab-ipadic-NEologd] : When you use MeCab, you can set '/usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd' as a value of '-d' option of MeCab. [install-mecab-ipadic-NEologd] : Usage of mecab-ipadic-NEologd is here. Usage: $ mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd ...

[install-mecab-ipadic-NEologd] : Finish.. [install-mecab-ipadic-NEologd] : Finish..

※見なくて良いと思います。

import MeCab
import subprocess

cmd='echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
path = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
                           shell=True).communicate()[0]).decode('utf-8')
                          
sample_txt = "新型コロナウイルスの感染拡大によって、リモートワークが推奨されている。"
m = MeCab.Tagger()
print("Mecab:\n", m.parse(sample_txt))

m = MeCab.Tagger("-d {0}".format(path))
print("Mecab ipadic NEologd:\n",m.parse(sample_txt))

Out[4]

Mecab ipadic NEologd:
新型コロナウイルス 名詞,固有名詞,一般,*,*,*,新型コロナウイルス,シンガタコロナウイルス,シンガタコロナウイルス
の 助詞,連体化,*,*,*,*,の,ノ,ノ
感染拡大 名詞,固有名詞,一般,*,*,*,感染拡大,カンセンカクダイ,カンセンカクダイ
によって 助詞,格助詞,連語,*,*,*,によって,ニヨッテ,ニヨッテ
、 記号,読点,*,*,*,*,、,、,、
リモートワーク 名詞,固有名詞,一般,*,*,*,リモートワーク,リモートワーク,リモートワーク
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
推奨 名詞,サ変接続,*,*,*,*,推奨,スイショウ,スイショー
さ 動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れ 動詞,接尾,*,*,一段,連用形,れる,レ,レ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
いる 動詞,非自立,*,*,一段,基本形,いる,イル,イル
。 記号,句点,*,*,*,*,。,。,。
EOS

新型コロナウイルス」と「リモートワーク」が分割されずに、一つの単語として認識されました。

5. Sentencepiece

形態素解析器には、Sentencepieceというものがあります。
Sentencepieceの詳細は、SentencePieceについて書いてみる。をご覧ください。
簡単に言うと、単語の辞書を基に、コーパスを単語に分割し、各単語の頻度を求めます。
次に、高頻度で出現する単語は1語として扱い、低頻度で出現する単語は、より短い単語に分割します。
そして、語の数が事前に指定した数になるまで分割します。 これにより、未知語をなくすことを可能にしています。

Sentencepieceの実装については、多くのコーパスが必要なので、また、いつか行います、、、
今すぐ実装したい方は、上述のSentencePieceについて書いてみる。の著者と同じ方が実行したSentencePieceを使ってみた。というのがおすすめです。

6. JUMAN++

形態素解析器には、JUMAN++というものがあります。
JUMAN++の詳細は、日本語形態素解析システム JUMAN++に掲載されています。
MeCabより高性能だそうです。 実装方法はColabでJUMAN++を使うを参考にしています。

7. Juman++の実装

まず、必要なパッケージをインストールします。

!wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc2/jumanpp-2.0.0-rc2.tar.xz
!tar xfv jumanpp-2.0.0-rc2.tar.xz  
%cd jumanpp-2.0.0-rc2
!mkdir bld
%cd bld
!cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local
!make install -j2

Out[1] さすがに長いので省略します。

MeCabと比較すると、インストールが終わるまで随分と長いです。

# 動作確認(コマンド)
!echo "新型コロナウイルスの感染拡大によって、リモートワークが推奨されている。" | jumanpp

Out[2]

新型 しんがた 新型 名詞 6 普通名詞 1 * 0 * 0 "代表表記:新型/しんがた カテゴリ:抽象物"
コロナ ころな コロナ 名詞 6 普通名詞 1 * 0 * 0 "代表表記:コロナ/ころな ドメイン:科学・技術 カテゴリ:自然物"
ウイルス ういるす ウイルス 名詞 6 普通名詞 1 * 0 * 0 "代表表記:ウイルス/ういるす ドメイン:健康・医学 カテゴリ:動物"
の の の 助詞 9 接続助詞 3 * 0 * 0 NIL
感染 かんせん 感染 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:感染/かんせん ドメイン:健康・医学 カテゴリ:抽象物"
拡大 かくだい 拡大 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:拡大/かくだい カテゴリ:抽象物 反義:名詞-サ変名詞:縮小/しゅくしょう"
に に に 助詞 9 格助詞 1 * 0 * 0 NIL
よって よって よる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:因る/よる"
@ よって よって よる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:寄る/よる 自他動詞:他:寄せる/よせる"
@ よって よって よる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:縒る/よる 自他動詞:自:縒れる/よれる"
@ よって よって よる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:選る/よる"
、 、 、 特殊 1 読点 2 * 0 * 0 NIL
リモート リモート リモート 名詞 6 普通名詞 1 * 0 * 0 "自動獲得:Wikipedia Wikipedia多義"
ワーク ワーク ワーク 名詞 6 普通名詞 1 * 0 * 0 "自動獲得:Wikipedia Wikipedia多義"
が が が 助詞 9 格助詞 1 * 0 * 0 NIL
推奨 すいしょう 推奨 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:推奨/すいしょう カテゴリ:抽象物"
さ さ する 動詞 2 * 0 サ変動詞 16 未然形 3 "代表表記:する/する 自他動詞:自:成る/なる 付属動詞候補(基本)"
れて れて れる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 タ系連用テ形 14 "代表表記:れる/れる" いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる" 。 。 。 特殊 1 句点 1 * 0 * 0 NIL EOS

8.Juman++とpyknp

JUMANとKNPが対応付けされているパッケージです。
詳細は、pyknpを参照してください。
簡単に説明すると、形態素ごとに分割するのがJUMAN++で、文節および基本句間の係り受け関係・格関係・照応関係を出力するのがpyknpです。

pipでインストール可能です。

!pip install pyknp

Out [3]Collecting pyknp Downloading https://files.pythonhosted.org/packages/1d/0e/93221dc85bd214b87b37bdd56af384b252e882fdb91e39c842a2614a8822/pyknp-0.4.5.zip (43kB) |████████████████████████████████| 51kB 4.4MB/s Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from pyknp) (1.15.0) Building wheels for collected packages: pyknp Building wheel for pyknp (setup.py) ... done Created wheel for pyknp: filename=pyknp-0.4.5-cp36-none-any.whl size=40420 sha256=0f8ecd7e9bb598257f7a1c30cd379ec9642df746cf8fc5e31cf9116457658edb Stored in directory: /root/.cache/pip/wheels/7d/0c/46/495789d5ca85293c2478f5bd81e1204f77f949645cb35bf382 Successfully built pyknp Installing collected packages: pyknp Successfully installed pyknp-0.4.5

実行します。

from pyknp import Juman
jumanpp = Juman()
result = jumanpp.analysis("新型コロナウイルスの感染拡大によって、リモートワークが推奨されている。")
for mrph in result.mrph_list():
    print(mrph.midasi, mrph.yomi, mrph.genkei, mrph.hinsi)

Out[4]

新型 しんがた 新型 名詞
コロナ ころな コロナ 名詞
ウイルス ういるす ウイルス 名詞
の の の 助詞
感染 かんせん 感染 名詞
拡大 かくだい 拡大 名詞
に に に 助詞
よって よって よる 動詞
、 、 、 特殊
リモート リモート リモート 名詞
ワーク ワーク ワーク 名詞
が が が 助詞
推奨 すいしょう 推奨 名詞
さ さ する 動詞
れて れて れる 接尾辞
いる いる いる 接尾辞
。 。 。 特殊

個人的には、*などなく簡潔なため、実装にやりやすそうだなという印象です。
ただ、NEologdなどの適応はできそうですが、参考文献がないのでやりづらそうです。
個人的にはここまでやってみたいところ、、、

以上、単語分解についてでした!
p.s. 最近、乾燥で肘が象さんのかかとかな?って言うくらいかさかさしてて黒ずんでいます。
象さんのかかと見たことありませんが、、、


【参考文献】
膠着語 (wiki)
屈折語 (wiki)
日本語形態素解析の裏側を覗く!MeCab はどのように形態素解析しているか
github
SentencePieceについて書いてみる。
SentencePieceを使ってみた。
日本語形態素解析システム JUMAN++
ColabでJUMAN++を使う