Thursday, April 17, 2014

ICPSR Congressional Member IDs in Python

Keith Poole and Howard Rosenthal have put together the authoritative list of Congressmen's member IDs, which they make available here at voteview.com.

To make using it from python a bit easier, I wrote a simple file to read in the files, as well as to map parties and state IDs to names. I provide it below in case it proves useful to others!


Note that the script includes python-usable versions of: mappings from ICPSR state IDs to state abbreviations and state names, AND mappings from Political Party IDs to Party Names, which may be useful in their own right.

Edit: Updated link to a new host.

Wednesday, April 16, 2014

Fox stop words list

In the process of doing some NLP work, I needed a specific english-language stop-word list, that developed in Fox, Christopher. "A stop list for general text." ACM SIGIR Forum. Vol. 24. No. 1-2. ACM, 1989.

Unfortunately I couldn't find it on the web in usable form. So I created it based upon a PDF of the paper. And here it is:

This is a stop list including 421 "words" (including all the letters and a couple other non-words).

You may also be interested that there is a more basic stop list with 127 words provided in python's nltk:
from nltk.corpus import stopwords
stopwords.words('english')

Finally, there is a project providing more extensive stop word lists in 29 languages. In case google code goes down, the latest collection is here.

P.S. Stop words are "insignificant" words that are typically removed before certain types of textual processing.

Edit: Updated the links to point to a new host as the last host deleted them.