Often when working with text documents it is useful to filter out words that occur frequently in all documents (e.g. 'the', 'is', ...). These words, called stop words, don't give any special hint about the document's content. The nltk (Natural Language Toolkit) library for python includes a list of stop words for several languages. For example:
from nltk.corpus import stopwords
stop_word_list = stopwords.words('english')
# to quickly test if a word is not a stop word, use a set:
stop_word_set = set(stop_word_list)
document = 'The data is strong in this one'
for word in document.split():
if word.lower() not in stop_word_set:
print word
# outputs: data, strong, one
You will probably first have to download the stop words using nltk
's download()
function. The following code should give you a GUI window to select the data you want (look for stopwords under the "Corpora" tab):
import nltk
nltk.download()
Similar Posts
- Get a list of all English words in python, Score: 0.998
- Using topic modeling to find related blog posts, Score: 0.988
- Topic modeling of Shakespeare characters, Score: 0.981
- How to quickly test if an element belongs to a group, Score: 0.979
- Analysis of Shakespeare character speech topics, Score: 0.922
Comments