中贝叶斯分类中程导弹入昂科拉SS源例子,中文词云代码调节和测试

词云是个很有趣的东西。

华语词云代码调节和测试,中文调节和测试

词云是个很好玩的事物。

用jieba断词,随笔文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很关键,要求不断调整补充。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

  wordcloud.to_file(‘test.png’)

效果图:

美高梅开户网址 1

上面是词频总括排序,词长排序的代码。

##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很方便,也值得借鉴,Python是个好东西,壮大,易重用。

 

词云是个很遗闻物。
用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,…

《机器学习实战》中贝叶斯分类中程导弹入猎豹CS六SS源例子,机器学习实战

随着书中代码往下写在此间卡住了,或者还会有别的同学也蒙受了这么的主题素材,记下来分享。

 

跟着书中代码往下写在此间卡住了,考虑到可能还会有别的同学也超过了那般的难点,记下来分享。

用jieba断词,小说文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很首要,须求不停调节补充。

怎么设置feedparser?

按书中提供的网站间接设置feedparser会提示错误说并未有setuptools,然后去找setuptools,官方的布道是windows最好用ez_setup.py安装,我确实下载不下来官方网站的不得了ez_etup.py,这么些帖子给出了消除方案:

ez_setup.py

将那几个文件直接拷贝到C:\\python2七文书夹中,输入命令行:python
ez_setup.py install

接下来转到放feedparser安装文件的文书夹中,命令行输入:python setup.py
install

 

先捉弄一下,相信超过44%网络朋友在那边卡住的显要缘由是了不起的GFW,所以无论是软件FQ依然肉身FQ的伴儿们估量是无论如何也看不到那篇博文的,不想往下看的请自觉使用FQ本领。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

小编提供的SportageSS源链接“

书中小编的意趣是的话自源
中的小说作为分类为一的稿子,以来自源
中的小说作为分类为0的稿子

为了能够跑通示例代码,可以找两可用的陆风X8SS源作为取代。

自己用的是那五个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

也正是说,假使算法运维正确的话,全体来自于 nasa
的稿子将会被分类为一,全数来自于yahoo sports的休斯顿火箭队(休斯敦 罗克ets)音信将会分类为0

 

 

  wordcloud.to_file(‘test.png’)

动用自身定义的大切诺基SS源,当程序运行到trainNB0(array(trainMat),array(trainClasses))时会报错,如何是好?

从书中小编的事例来看,作者利用的源中文章数量较多,len(ny[‘entries’])
为 十0,我找的多少个 冠道SS 源唯有10-十多个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为小编的3个CR-VSS源有⑩0篇文章,所以他得以在代码中剔除了二十四个“停用词”,随机挑选20篇小说作为测试集,可是当大家接纳代替卡宴SS源时大家只有10篇小说却要收取20篇文章作为测试集,那样鲜明是会出错的。只要自身调控下测试集的数据就能够让代码跑通;借使文章中的词太少,收缩剔除的“停用词”数量能够加强算法的准确度。

 

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提醒错误说并未有setuptools,然后去找setuptools,官方的说法是windows最佳用ez_setup.py安装,小编的确下载不下去官方网站的百般ez_etup.py,这一个帖子给出了消除方案:

ez_setup.py

将这些文件一贯拷贝到C:\\python27文件夹中,输入命令行:python
ez_setup.py install

然后转到放feedparser安装文件的文本夹中,命令行输入:python setup.py
install

 

效果图:

假若不想将现出频率排序最高的二1五个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(代替移除高频词的代码)。具体需求停用哪些词能够参照那里

以下代码想健康运维供给将停用词保存至stopword.txt中。

本人的txt中保存了以下单词,效果还不易:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
中贝叶斯分类中程导弹入昂科拉SS源例子,中文词云代码调节和测试。me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
美高梅开户网址 ,was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb

小编提供的BMWX三SS源链接“

书中小编的意趣是的话自源
中的小说作为分类为一的小说,以来自源
中的小说作为分类为0的稿子

为了能够跑通示例代码,能够找两可用的福特ExplorerSS源作为代表。

本人用的是那三个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

也正是说,若是算法运转正确的话,全体来自于 nasa
的文章将会被分类为壹,全部来自于yahoo sports的休斯顿休斯敦火箭新闻将会分类为0

 

美高梅开户网址 2

利用自个儿定义的奥迪Q7SS源,当程序运转到trainNB0(array(trainMat),array(trainClasses))时会报错,如何是好?

从书中作者的事例来看,笔者利用的源汉语章数量较多,len(ny[‘entries’])
为 十0,我找的多少个 奥迪Q三SS 源唯有十-十几个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为小编的3个奥迪Q5SS源有十0篇小说,所以他得以在代码中剔除了二6个“停用词”,随机选用20篇小说作为测试集,可是当我们接纳代替哈弗SS源时大家唯有十篇小说却要抽取20篇小说作为测试集,那样显明是会出错的。只要自身调控下测试集的数据就可以让代码跑通;借使文章中的词太少,减弱剔除的“停用词”数量能够巩固算法的准确度。

 

上边是词频计算排序,词长排序的代码。

跟着书中代码往下写在此地卡住了,恐怕还会有其余同学也遭受了那样的问…

比方不想将应运而生频率排序最高的二105个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体需求停用哪些词能够参考那里

以下代码想健康运作须要将停用词保存至stopword.txt中。

自笔者的txt中保存了以下单词,效果还不易:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很有益于,也值得借鉴,Python是个好东西,强大,易重用。

 

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图