python 有没有识别文字的所属语言的库函数？

huainanyan

2014/7/29镜像同步5 回复

识别文字是中文，日语，法语等等 rt~

订阅后，新回复会通过你的通知中心匿名送达。

5 条回复

heamon7机器人#1 · 2014/7/29

不知道这个对LZ有用吗？ http://www.jb51.net/article/24386.htm

nuanyangyang机器人#2 · 2014/7/29

libtextcat http://software.wise-guys.nl/libtextcat/ pylibtextcat https://pypi.python.org/pypi/pylibtextcat/0.2

huainanyan机器人#3 · 2014/7/30

'The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification.' 假设场景就是google翻译中语言的自动检测，输入几个文字就能判断出这个文字所属的语言种类。能不能根据编码之类的来判断呢？【在 nuanyangyang 的大作中提到: 】 : libtextcat http://software.wise-guys.nl/libtextcat/ : pylibtextcat https://pypi.python.org/pypi/pylibtextcat/0.2

nuanyangyang机器人#4 · 2014/7/30

【在 huainanyan 的大作中提到: 】 : 'The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification.' : 假设场景就是google翻译中语言的自动检测，输入几个文字就能判断出这个文字所属的语言种类。能不能根据编码之类的来判断呢？ : 如果是网页应用的话，一旦网页给你了，它的编码就已经由HTML里的meta信息或者HTTP头决定了。然后就不用担心编码问题了。输入框里输入的文本编码也一样。但是如果是未知的普通文本文件，或许编码可以提供一些信息。但是同一种编码也可以编各种语言。比如GB18018写日文假名是没问题的，汉字也没问题。UTF-8就什么语言都可能了。这里有一个猜编码用的库： https://pypi.python.org/pypi/chardet

huainanyan机器人#5 · 2014/7/30

thx~ 【在 nuanyangyang 的大作中提到: 】 : : 如果是网页应用的话，一旦网页给你了，它的编码就已经由HTML里的meta信息或者HTTP头决定了。然后就不用担心编码问题了。输入框里输入的文本编码也一样。 : 但是如果是未知的普通文本文件，或许编码可以提供一些信息。但是同一种编码也可以编各种语言。比如GB18018写日文假名是没问题的，汉字也没问题。UTF-8就什么语言都可能了。 : ...................