Email Data Mining Script in Python
So I’m a little bored with my summer internship at Mississippi State University. It’s not a good idea to let people with creative minds like mine get bored, because we start to do “interesting” things. Anyway, I was in charge of visiting a whole bunch of websites and identifying their webmasters. I didn’t want to visit the whole 100 sites or so to identify the webmasters myself, and some of the websites did not have any emails on them at all. In light of this, to help with the process of finding webmasters I created an email data mining script in python, which you can download here. I decided to make it available under the GNU Public License 3.0. A couple things to note about it are 1) it only runs in unix environment because it relies on the wget and host commands 2) when downloading some of the content with wget I had occasionally come across websites with massive amounts of media on them that crashed the script. Although I have tailored the wget command that is called in this python script not to download movies or picture files, I did experience problems with python choking and dying due to out of memory errors. I also noticed on some websites where media files are specified as odd tokenized URLs, instead of ending with a file type, they are still downloaded even though they are media files. I don’t plan to upkeep this script at all. This is why I am releasing it on my website only and not on sourceforge.com. The only other email mining scripts out there that I know of currently are The Harvester, which is an open source email mining script that just searches google, bing, or other search engines and Maltego, which you have to pay for in order to download the emails you find. If you know of others and want to call me out on them, leave them in my comments . My email mining script simply downloads an entire website to disk as a single file, then uses a regular expression based on the RFC 2822 standard cited by http://www.regular-expressions.info/email.html to find emails and print them out to the screen. You are welcome to use it, but it is slow and I guarantee nothing as far as accuracy and benchmarks.
To use the script emailMiner.py, make a text file containing line by line the IP addresses of the websites you want to visit, then run the command “python emailMiner.py websites.txt”. You will probably want to modify this script so that it writes to a file location of your choice because everything is hard coded.
Oh yeah and one more thing… Nigerians you are not allowed to download this email mining tool. I am just kidding (I’m referring to the 409/419 phishing scammers of course). But seriously this tool is meant for people that need to mine email data for legitimate reasons such as penetration testing, auditing, etc. Please do not abuse it.