Wednesday, April 16, 2014

Fox stop words list

In the process of doing some NLP work, I needed a specific english-language stop-word list, that developed in Fox, Christopher. "A stop list for general text." ACM SIGIR Forum. Vol. 24. No. 1-2. ACM, 1989.

Unfortunately I couldn't find it on the web in usable form. So I created it based upon a PDF of the paper. And here it is:

This is a stop list including 421 "words" (including all the letters and a couple other non-words).

You may also be interested that there is a more basic stop list with 127 words provided in python's nltk:
from nltk.corpus import stopwords
stopwords.words('english')

Finally, there is a project providing more extensive stop word lists in 29 languages. In case google code goes down, the latest collection is here.

P.S. Stop words are "insignificant" words that are typically removed before certain types of textual processing.

Edit: Updated the links to point to a new host as the last host deleted them.

1 comment:

  1. Thank you, for Fox stop words list text file. :)

    ReplyDelete