11 August 2023
Python has become one of the most popular programming languages for data analysis and data science. Its versatility, large collection of powerful libraries, and ease of use make it a great choice for analysing data and building machine learning models. This comprehensive guide will teach you how to set up your Python environment, learn core programming concepts, analyse and visualise data with popular libraries like Pandas and Matplotlib, and take your data analysis skills to the next level.
Python has become one of the most popular programming languages for data analysis and machine learning. Which is why so many companies opt to place staff on a data analyst course focused specifically on data analysis skills.
To start analysing data in Python, you first need to set up your working environment. The easiest way to get started is to install a Python distribution like Anaconda that comes bundled with the most popular data science libraries.
Anaconda is an open source Python distribution created by Continuum Analytics. It includes Python, common scientific computing packages like NumPy and SciPy, data science tools like Jupyter Notebook, and over 250+ other Python packages for data science. Anaconda makes it easy to get started with Python data analysis without having to deal with environment configuration and dependency conflicts.
Follow these steps to install Anaconda on your operating system:
conda list
to verify it works.jupyter notebook
on the command line. This will open up a browser window with the Notebook dashboard, allowing us to start coding in Python!With Anaconda and Jupyter Notebook installed, you now have an excellent environment for data analysis in Python!
The major libraries we'll use for data analysis are:
We can import these libraries right in a Jupyter Notebook cell:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
This imports NumPy as np, Pandas as pd, and Matplotlib's Pyplot module as plt for easy usage.
Before diving into data analysis, it's important to learn some core Python programming concepts. Python makes it easy to get started with its readable syntax, concise code, and robust standard library.
Python uses simple, English-like syntax with indented code blocks representing scope. Here are some key syntax elements:
#
denotes comments()
call functions and enclose tuples[]
access array elements and denote lists{}
contain dictionaries and sets=
assigns values to variablesYou can store data in Python variables like so:
name = "John"
age = 25
weights = [150, 152, 148] #list
Common Python data types include:
Control flow statements allow you to iterate over sequences and make comparisons:
For Loops
for num in [1, 2, 3]:
print(num)
While Loops
count = 0
while count < 5:
print(count)
count += 1
If Statements
age = 20
if age >= 18:
print("You are eligible to vote")
else:
print("You are not eligible to vote yet")
Functions group code you want to reuse. Here's a basic function that prints a greeting:
def greet(name):
print("Hello " + name)
greet("Bob") #call function
You can return values from functions like so:
def square(x):
return x * x
result = square(5)
print(result) #25
With these basic programming concepts, you can now write simple Python scripts!
The Pandas library makes it extremely easy to work with structured data in Python. Learn how to load, view, subset, and manipulate your data for analysis.
Pandas provides two core data structures - Series and DataFrames:
We can create a Series from a list:
import pandas as pd
s = pd.Series([5, 2, -1, 3])
print(s)
#0 5
#1 2
#2 -1
#3 3
And build a DataFrame from a dictionary of Series:
data = {"Name":["John", "Mary"], "Age":[30, 25]}
df = pd.DataFrame(data)
print(df)
# Name Age
#0 John 30
#1 Mary 25
Pandas makes loading external datasets easy with read_csv()
, read_json()
, read_excel()
, and other I/O functions.
For example, we can load the built-in Titanic dataset:
titanic = pd.read_csv("titanic.csv")
Pandas provides nice indexing and slicing syntax for viewing subsets of your data:
df.head() #first 5 rows
df.loc[0:2, ['Name', 'Age']] #rows 0-2, Name and Age columns
df[df['Age'] > 25] #rows with Age > 25
Common aggregation functions allow you to summarize and process data in Pandas:
df['Age'].mean() #average age
df.groupby('Sex').sum() #sum by sex
df.sort_values('Fare', ascending=False) #sort by fare
As you can see, Pandas gives you extremely powerful, expressive tools for working with structured datasets in Python!
Python's Matplotlib library provides simple yet highly customizable plotting capabilities to help you visualize your data.
The Pyplot module in Matplotlib provides a MATLAB-style interface for building plots:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
plt.plot(x, y) #plot line chart
plt.show()
We can also make scatter plots, bar charts, histograms, and more:
plt.scatter(x, y) #scatter plot
plt.bar(x, y) #bar chart
plt.hist(y) #histogram
Matplotlib makes it easy to customize every element of a plot:
plt.title("Sales Over Time")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.legend(['Store 1', 'Store 2'])
plt.style.use('dark_background')
This adds a title, axis labels, legend, and custom styling.
Pandas has built-in timeseries support that integrates nicely with Matplotlib:
ts = pd.DataFrame(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
This plots a random walk timeseries dataset that is indexed by dates.
Matplotlib empowers you to gain insights from your data through impactful and customizable visualizations.
While Pandas and Matplotlib help you explore and visualize data, Numpy and SciPy provide foundational analytics capabilities, and Scikit-Learn enables advanced machine learning modeling.
NumPy (Numeric Python) provides Python's core mathematical and statistical operations for arrays:
import numpy as np
data = np.random.randn(10) #random data
data.mean() #mean
data.std() #standard deviation
data.median() #median
np.percentile(data, 75) #75th percentile
NumPy vectorizes these operations so they are fast and efficient.
The Scikit-Learn library contains many commonly used machine learning algorithms:
from sklearn.linear_model import LinearRegression
X = df[['Age']] #predictors
y = df['Sales'] #target
model = LinearRegression()
model.fit(X, y)
model.predict([[30]]) #predict sales for 30 year old
Here we built a simple linear regression model to predict sales based on age. Scikit-Learn makes it easy to build and evaluate predictive models.
We can evaluate model performance using evaluation metrics:
from sklearn.metrics import mean_squared_error, accuracy_score
y_true = [1, 1, 0, 1]
y_pred = [0, 1, 0, 1]
accuracy_score(y_true, y_pred) #accuracy
mean_squared_error(y_true, y_pred) #MSE
Metrics like accuracy, MSE, precision, recall, etc. are provided to measure model performance.
Here are some next steps for advancing your data analysis skills:
With practice, you'll be well on your way to becoming a data analysis expert in Python!
How long does it take to learn Python for data analysis?
It depends on your prior programming experience, but expect it to take at least 2-3 months of consistent practice to become comfortable with Python data analysis. Focus on getting exposure to key concepts versus trying to master everything.
What math do I need to know for Python data analysis?
Basic high school math around algebra, statistics, and probability is sufficient. Python libraries like Numpy and Pandas handle most of the advanced math behind the scenes.
What are some good project ideas for learning Python analysis?
Some beginner project ideas: exploratory data analysis on public datasets (titanic, iris flowers), build a simple linear regression model to predict home prices, analyze your own CSV data like budget expenses in Pandas, build a dashboard to visualize COVID-19 data using Matplotlib.
What are some other helpful Python data analysis libraries?
Seaborn for statistical data visualization, Scrapy for web scraping, SQLAlchemy for SQL support, StatsModels for statistical modeling, Keras for deep learning. The Python ecosystem has tons of great libraries for data tasks!
Check out our Blog for up to date information in the latest Python developments including articles on Unlocking the Power of AI with Python and How many days required to learn Python for data analysis?
CONTACT
+44 (0)20 8446 7555
Copyright © 2023 JBI Training. All Rights Reserved.
JB International Training Ltd - Company Registration Number: 08458005
Registered Address: Wohl Enterprise Hub, 2B Redbourne Avenue, London, N3 2BS
Modern Slavery Statement & Corporate Policies | Terms & Conditions | Contact Us