python - Document classification in spark mllib -


i want classify documents if belong sports, entertainment, politics. have created bag of words output somthing :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want implement naive bayes algorithm classification using spark mllib. question how convert output can naive bayes use input classifcation rdd or if there trick can convert directly html files can used mllib naive bayes.

for text classification, need:

  • a word dictionary
  • convert document vector using dictionary
  • label document vectors:

    doc_vec1 -> label1

    doc_vec2 -> label2

    ...

this sample pretty straghtforward.


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -