If you ever wonder which language (technology) to use for your day-to-day data analysis, I can tell you from the get-go that I am not going to answer that question for you in this article. I will definitely recommend Python, but that is because I have worked with Python extensively, so I am very biased. What I can do in this article, however, is to introduce you to the most used Python libraries for data science and provide some of their basic functionality.
Python is a general purpose, complex, efficient and easy to learn language but Python is also perfect for working with and understanding data; big data, small data, structured or unstructured data, any data basically.
I will not go into the particularities of data science, but I will mention some of the essential tick boxes for a data analysis framework/project:
- Collect data from different sources
- Process and clean data in an efficient manner
- Extract relevant information from data (either by exploratory data analysis or through modeling and algorithms)
- Visualize and communicate the results in an appealing and easy to understand format
- Some magic, to make sure that everything comes together beautifully and on time
With that said, if you choose Python for your project you might be able to cover all these expectations enumerated above, having a vast number of libraries that can solve a variety of data problems.
This is by no means an exhaustive list, and I am by no means a python expert just a python enthusiast with background in data analysis. I tried to create here a list with some of the commonly used packages for data science, or, in other words, the packages that get the job done, and then some. There are others, of course, but they do the same things as the one on my list, plus or minus a few things and they are not so popular.
As I said above, I will also illustrate some of the basic capabilities and provide an example for each library, with a few lines of code, to give a general idea of where to begin with these libraries and what you can do with them.
I would recommend installing the Anaconda distribution because it contains all of these libraries I will mention plus many more (roughly 1000, depending on the Python version and OS). Or you can install miniconda and then use the conda package manager to add any package you are interested in, with the following command:
$ conda install PACKAGENAME
Otherwise, if you already have Python installed, use the pip installer:
$ pip install PACKAGENAME
This is the first library that I would suggest to any newcomer to the data science field. This library, part of the Jupyter project, offers an interactive tool for writing Python code and text in your browser, but it also displays the output nicely, right under your code.
I will use Jupyter Notebook for the code snippets in this article, but here are a few basics below.
To start, just type the following in the console:
After the application opens up in the browser, at http://localhost:8888, you can create a new file, or upload one, from the upper right side.
Upon opening a new notebook, you can write code or text in the cells, run it and the result appears under the cell. Below is just the regular “Hello world!” example:
You can find an extensive tutorial here.
Please keep in mind that closing the browser will not close the notebook app. You have to close the associated terminal.
This is the fundamental package for scientific computing in Python. It provides an abundance of useful features for operations on n-arrays and matrices in Python, as well as routines that allow developers to perform advanced mathematical and statistical functions on those arrays with as little code as possible.
While NumPy does not provide many data science functionalities, by understanding array-oriented computing will help you use other Python data analysis tools more effectively.
I chose to illustrate in the following example a Numpy array and then plot it as a checkerboard using Matplotlib library.
Matplotlib allows you to quickly make line graphs, pie charts, histograms and other figures. It offers powerful visualizations and is an excellent competitor to MatLab and Mathematica. There are also facilities for creating labels, grids, legends, and many other formatting entities – basically, everything is customizable. It supports different GUI backends on all operating systems, and can also export graphics to common vector and graphics formats like PDF, SVG, JPG, PNG, BMP, GIF, etc.
Below are a simple sine and cosine plotting.
Given the example from the previous chapter, let’s play with the interpolation a little:
Pandas are not just a cute name (it actually derives from the word Panel Data – which, in econometrics, means multidimensional data involving measurements over time) but a robust library for creating data structures and also for data manipulation and analysis.
Some of the highlights of Pandas are:
- Labeled array data structures, some of which are Series (one-dimensional array) and DataFrame (two-dimensional array)
- Tools for loading data into in-memory data objects from different file formats (CSV, delimited, Excel 2003 but also PyTables/HDF5 format).
- Label-based slicing, indexing, and subsetting of large data sets.
- Group-by engine that allows aggregation and transformations
- Being highly optimized for performance
Here is an example of using the DataFrame object and then viewing its content and index.
This is another core package for scientific computing, built on Numpy and using its array data type and other functionalities. It contains modules for linear algebra, optimization, integration, and statistics but also adds a collection of algorithms and high-level commands for manipulating and visualizing data. All these modules depend on Numpy, and that means that Numpy is imported as well when working with SciPy, but the SciPy modules are independent of each other.
Remember algebra? Yeah, me neither. Well, Scipy has some fun modules that let you calculate all these functions that gave you headaches in high school. Let’s take, for example, integration. Here is a most basic integration of x2 on the interval [0,4]:
Now, Scipy has different ways to solve this, using functions from scipy.integrate sub-package – some of which are quad() and simps(). Each of these functions uses different integration techniques but I will not go into the details as that would require some heavy knowledge of numerical analysis – not the point of this article.\
The first example:
The return value is a tuple, where the first element is the estimated value of the integral and the second element is the estimated absolute error.
The second example uses simps() function to estimate the integration of x2 on arbitrarily spaced samples. If the samples are not equally spaced, then the result is exact only if the function is a polynomial of order 2 or less. That means that if I would replace x2 with x3, the estimation will not be so precise because the order of the polynomial x3 is larger than two.
My favorite library from this list has to be Bokeh, for its data visualization capabilities as this is my field of expertise. As I’ve said above, Bokeh is a great visualization tool that provides beautifully constructed graphs with high interactivity on large or streaming data sets. You can easily create interactive plots, dashboards, and data applications with Bokeh. Other advantages are the different output options and the various customization options.
Notable similar libraries are Seaborn and, of course, Matplotlib mentioned above.
In the Matplotlib use case, I created the sine and cosine graphs. This time I will make them a little fancier with Bokeh.
Most of the libraries presented here are used for accessing massive datasets. But what if your data has to be extracted either from documents or web pages? Scrapy is a web crawling framework used to create spider bots that continuously crawl the web and extract structured data like prices, contact info, and URLs. Originally designed for web scraping, it can also be used to extract data from APIs.
If you are using Anaconda, then you will have to install this package separately as it is not part of the distribution.
Other data mining and NLP Python libraries are NLK and Pattern that I will not present in this article.
Basic usage: create a Python spider – a class that contains the extraction logic for a website. Then run the spider from the command line.
I am going to create a basic script that extracts the main department names from the well-known emag.ro. I will not go over the installation and creation of a project as this information is easily accessible on the internet and is beyond the scope of this article. But you should know that all the files that we need or create are stored under the folder of the project. Also, I can either use the command line scrapy tool or, with some tweaking, the Jupyter Notebook to exemplify this library. I will be using that later on in order to be consistent with the rest of the article.
I will start by creating the “item” – the data that you want to get from crawling the domain emag.ro:
After doing that, I will create the spider that controls the “crawl” action. Looking at the page source, I see that I need to look for the text “megamenu-list-department__department-name” of the <span> tag.
Finally, let the spider do its job:
If I were using the command line, I would issue the following command:
The extracted data can be nicely spooled to a JSON file. You can further mine this data and visualize it a pleasant manner.
This list would not be complete without Scikit-learn which is the most popular machine learning library. It is an extension of SciPy and NumPy but also adds a set of supervised and unsupervised learning algorithms for common machine learning and data mining tasks, including clustering, regression, and classification. Some of these algorithms are SVM, random forests, k-means and DBSCAN. This library is focused on modeling data not on loading, manipulating and summarizing data, like Pandas or Numpy.
Machine learning algorithms are a complex subject, and Scikit-learn has a steep learning curve thus even a short tutorial can span several pages and can be quite complicated. Nevertheless, I will try to give here just a basic linear regression example – the common case of fitting a line to (x, y) data.
Trends change and Python continuously evolves so the list presented in this article might change over time. But for now, if you either want to learn more about data science or are looking into technologies for an upcoming data project, (hopefully) this article could be a good start.