更新時間:2021-10-26 10:06:40 來源:動力節(jié)點 瀏覽997次
HanLP 是由一系列模型和算法組成的Java工具包。目標是普及自然語言處理在生產(chǎn)環(huán)境中的應(yīng)用。它不僅是分詞,還提供了詞法分析、句法分析、語義理解等完整的功能。HanLP 具有功能齊全、性能高效、結(jié)構(gòu)清晰、語料最新、功能可定制等特點。
HanLP 是完全開源的,包括字典。不依賴其他jar,底層使用了一系列高速數(shù)據(jù)結(jié)構(gòu),如雙數(shù)組Trie樹、DAWG、AhoCorasickDoubleArrayTrie等,這些基礎(chǔ)組件都是開源的。
通過工具類HanLP,可以一句話調(diào)用所有函數(shù),文檔詳細,開箱即用。底層算法經(jīng)過精心優(yōu)化,極速分詞模式下每秒可達200??0萬字,內(nèi)存僅需要120MB。IO方面,字典加載速度極快,快速啟動僅需500ms
POM文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.iqilu</groupId>
<artifactId>Segment</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Hello</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.3.2</version>
</dependency>
</dependencies>
</project>
DemoSegment.java
package com.iqilu;
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;
import java.util.List;
public ?class DemoSegment {
?public static void main(String[] args) {
?String[] testCase = new String[]{
?"Goods and services",
?"Married and unmarried are indeed interfering with participles",
?"Buy fruits and then come to the Expo and die at the Expo",
?"China's capital is Beijing",
?"Welcome the new teacher to come to dinner",
?"The virgin officer of the industry and information technology must personally explain the installation of technical devices such as 24 switches through the subordinate departments every month",
?"With the rise of web games, the current web games are prosperous and rely on archives. The design for logical judgment is reduced, but this one cannot be completely ignored.",
?};
?for (String sentence : testCase)
?{
?List<Term> termList = HanLP.segment(sentence);
?System.out.println(termList);
?}
?}
?}
結(jié)果
[Products/n, and/c, services/vn]
[Married/v, of/uj, and/c, not yet/d, married/v, of/uj, indeed/ad, at/p, interference/v, participle/n, ah/y]
[Buy/v, fruit/n, then/c, come/v, Expo/j, finally/f, go/v, Expo/j]
[China/ns, of/uj, capital/n, yes/v, Beijing/ns]
[Welcome/v, new/a, teacher/n, before death/t, come/v, dinner/v]
[Industry and Information Office/n, female/b, secretary/n, monthly/r, passing/p, subordinate/v, department/n, all/nr, personally/d,
Explain/v, 24/m, port/q, switch/n, etc/u, technical/n, device/n, of/uj, installation/v, work/vn]
[With/p, page/q, youxing/n, from/v, to/v, now/t, of/uj, page tour/nz, flourishing/an,,/w,
Depend on/v, archive/vn, proceed/v, logic/n, judge/v, of/uj, design/vn, reduce/v, up/ul,,/w,
But/c, this piece of/r, also/d, cannot/v, completely/ad, ignore/v, drop/v,./w]
Java分詞工具只是眾多的Java開發(fā)工具之一,以后大家還會接觸到更多相關(guān)知識。