I set out a few weeks ago a person with a grand and beautiful mission: to scrape the National UFO Reporting Center Database. As we are all aware, The Truth is Out There, and as a statistician trying to learn python for data analysis and data science purposes, I wanted to find The Truth through data.
Not a true beginner to Python, I had some ideas about how I might go about scraping the database. Data mining from the web is one of my overarching goals for my time in the ChiPy mentorship program. So, fresh and as green as what an alien could look like (who’s to be sure) I started learning Scrapy. I followed their tutorials, which included learning how to utilize a virtual environment on my computer.
How to setup a virtual environment: I learned this by googling it, but now I’m enough of a pro to satisfy my deeply ingrained perfectionism. I had used virtual environments inside cloud9, an interactive programming tool that I’ve used taking rmotr.com classes, but had not done it on my computer yet.
The need for more precise language: On a call with Arpit, my mentor, I made the mistake of saying: “Oh I used a virtual machine for Scrapy”, at which point he was like “…I truly doubt it.” (I may have taken some leeway in representing what he said, but bear with me.) He then proceeded to explain to me the difference between a virtual machine and a virtual environment, which I will not besmirch his good name further and try to reproduce verbatim here. As a statistician, I understand that accuracy in language is crucial. It grinds my gears every time someone uses correlation when they mean association, and don’t get me started on multivariate versus multivariable. Being more precise in my language will help me communicate with other coders in the future.
If you hadn’t guessed by clicking the link and looking at the NUFORC database, Scrapy wasn’t going to do it on the scraping front. I discovered very quickly that the database is about the closest thing to plain text as HTML can be. If I had read the Scrapy tutorial before I started it, I would have realized the next thing on my list.
I was first introduced to coding when I took a 2 semester Intro to Java in 2012. From there, I learned SAS and R as I was getting my Master’s degree. I am all too aware that I take for granted the fact that I know enough Python (and the other languages I know, sometimes) to read and understand it, even if I can’t write it off the top of my head.
#Some things I learned on purpose:
# return 'PeopleQuerySet: {0} objects'.format(str(len(self.objects))
I know the code probably isn’t that exciting to see out of context, and I had to comment it out so R didn’t have a cow, but I remembered to use format statements and I’m proud of myself. I’m fortunate to have received a scholarship through Women Who Code prior to being selected for ChiPy to take rmotr.com’s Intro Python course, and Rmotr themselves generously supported me to pursue the Advanced course, where I also helped build a decorators library, a clone of Twitter complete with an API, a tic-tac-toe game, and more.
We utilized TF through a Docker container, which was an excellent first exposure to Docker, another thing I’ve had to use more of my newfound command line knowledge to use. It’s come in handy as my next purposeful learning opportunity came along.
#Regarding our Actual ChiPy Mentorship Project Even before I set out to scrape the NUFORC, I knew dataset was already on Kaggle and was additionally available in this GitHub repository.. I still wanted to try my hand at scraping, and Arpit is working on an actual solution using BeautifulSoup right now.
I’m currently working on descriptive data analysis using numPy and Pandas. We plan to regroup this weekend and do scoping and decide how big our project can really be in the next few months. We’d love to do some prediction using TensorFlow and other machine learning methods. The dataset is large: about 80k sightings, and the dataset on Kaggle is also geocoded. My specific goals over the course of the ChiPy Mentorship Program include:
Become proficient in numPy, Pandas, Seaborn, and matplotlib
Start to get familiar with scikit-learn’s methods of doing machine learning (excellent because I’ll basically be doing the same methods in tandem in R for my side gig)
Continue learning how to utilize TensorFlow to explore my interest in deep learning, either with the NUFORC data or complementary data
Actually scrape a relevant website to collect useful information
Create a cool portfolio project to showcase both my strong statistical base and my newfound Python skills, care of my mentor Arpit and the other participants in the spring 2017 ChiPy Mentorship Program.