# Introduction to Pandas

+ Library for data **manipulation**.
    + Select columns.
    + Filter rows based on a condition.
    + Adding new columns.
    + Summarising the data: group rows and calculate statistics like mean or max or count...
    + Transform columns: numeric to categorical, categorical to indicator/dummy.
+ **Series**:
    + A 1D object with an index.
    + _Like a list of values._
    + Length is fixed when it is created.
    + Elements should be of the same type.
+ **DataFrame**:
    + A 2D object with a row index and column index.
    + _Like a table of values; it has rows and columns._
    + Number of rows is fixed when created; columns can be added and removed.
    + Different columns can be of different types (heterogeneous); within each column, data should be of the same type (homogenous).

In [78]:
# Import the external library. (It needs to be installed before first use.)

## Series

### Create a Series object

In [12]:
# Create a Series from a list using the Series function. Pandas assigns a default index.

In [1]:
# View the Series. Use the autoprint feature.

In [16]:
# Assign an index manually using the index argument.

### Attributes (Properties)

In [13]:
# ndim attribute.

In [15]:
# shape attribute.

In [38]:
# size attribute.

In [1]:
# dtype attribute. This is printed in the output when we print a Series.

In [38]:
# index attribute. (For example, for assigning a new index.)

## DataFrame

### Create a DataFrame

In [130]:
# Create a DataFrame using the DataFrame function; pass a dict; the keys becomes col names, values col data; pandas assigns the row index.

In [131]:
# Autoprint.

In [129]:
# Manually assign the index.

### Attributes

In [135]:
# ndim.

In [136]:
# shape.

In [137]:
# size.

In [144]:
# dtypes. PLURAL: different columns an have different types.

In [143]:
# index. (For rows.)

In [142]:
# columns. (PLURAL)

## Selecting elements: basic indexing, slicing, Boolean indexing

Recall how indexing and slicing worked on Python lists.

**Problem with index and position for a numeric index.**

In [52]:
# Create a Series with an alphabetical index and select the first element like from a list.

In [40]:
# Repeat the above with an integer index starting at 1. (pandas expects a label.)

In [114]:
# Select a slice. (pandas expects a position.)

### pandas `.iloc[]` and `.loc[]` attributes

We can tell pandas whether we are specifying the location using the position or label.

+ Use `.iloc[]` to specify the location using an integer; the i stands for integer.
+ Use `.loc[]` to specify the location using the label.

Note: Non-numeric labels must be in quotes to tell Python that we are referring to a label, not a variable with the same name.

In [37]:
# Select the first element by position.

In [42]:
# Select the first element by label.

In [41]:
# Select the second element using the position.

**Difference between `.iloc[]` and `.loc[]**:
+ When using `.loc[]` the `end` label is included in the slice.

In [50]:
# Select a slice using .iloc[].

In [51]:
# Give the labels for the corresponding elements to .loc[].

### Boolean indexing

+ Select using a Boolean array of the same length (AND same index).
+ The Boolean array is usually generated using a condition. This ensures the length and index are the same.
+ Use the `.isin()` method to match values against a list of values.
+ Pass the Boolean array to the `.loc[]` attribute. (Only pass values if using the `iloc[]` attribute.)

Recall the  vectorization concept from numpy. Arithmetic/comparison operations are performed element-by-element with both a scalar and a vector of the same length (The index also matters in pandas).

In [73]:
# Generate a Boolean array using a condition on the Series. Is an element a multiple of ___?
# The condition is applied to each element, resulting in a Boolean array.

In [58]:
# Save the Boolean array (optional, but recommended).

In [74]:
# Select the elements corresponding to labels which have the value True, using .loc[].

In [75]:
# Another example.

In [76]:
# Without saving the Boolean array; code becomes a bit unreadable.

In [229]:
# Use the .isin() method to match values with a given list of values.

**Boolean Indexing on a DataFrame:**

To select rows which meet a condition on a column, think of the operation as two steps:
1. Select the required **column** and place the condition on it to generate a Boolean array.
2. Use the Boolean array to index the **rows** inside `.loc[]`.

In [187]:
# Select rows of a DataFrame.

In [228]:
# Another example.

In [227]:
# Use the .isin() method to match a SINGLE column's values with a given list of values.

### Combining Boolean arrays using the logical operators

+ We cannot use the logical keyword `and`, `or` and `not` with arrays; these expect a single Boolean value on either size.
+ We can combine multiple conditions using the logical operators `&` for `and` and `|` for `or`, and using `~` for negating a Boolean array.

Recall the concept of Truth Tables covered when learning about conditional statemetns in Python.

In [177]:
# Combine two arrays using &.

In [147]:
# Combine two arrays using |.

In [148]:
# Negate and array using ~.

In [176]:
# Parenthesize conditions if combining them directly, because logical operators (not keyword) have higher precedence than relational operators.
# 3>2>1
# 2 < 3 and 0 > -1
# 2 < 3 & 0 > -1
# Recall, 0 is treated as False and non-zero nubmers are treated as True.
# 3 & 0

## Mathematical and statistical functions (Reductions)

+ `.sum()`, `.mean()`, `.std()`, `.max()`, `.min()`.

### Examples on a Series

### Example on a DataFrame

+ Pandas gives **column-wise** results. Recall that numpy gives results for the full array.
+ Each column is considered independently. For example, be careful when interpreting the result of the `max()` method.
+ The presence of non-numeric columns will cause errors when methods like `mean()` or `std()` are used; use `numeric_only=True`.

Note:
+ Notice how the `sum()`, `min()` and `max()` work on strings.

In [197]:
# Create a 3x2 DataFrame with columns Name, Marks.

In [198]:
# min(): interpretation.

In [199]:
# sum(): concatenation of strings.

In [200]:
# mean(): error because of non-numeric columns.

In [204]:
# pd.DataFrame.mean?
# Use numeric_only=True.

### Comparison with numpy

In [213]:
# Create a numpy ndarray.       

In [214]:
# sum() is for the full array.

In [219]:
# min() is for the full array.

In [216]:
# Create a DataFrame from the array.

In [220]:
# sum() is column-wise.

In [221]:
# min() is column-wise.

### The axis argument

+ Which axis to **collapse**.
+ Use 0 or 'rows' to collapse rows; the function returns column statistics.
+ use 1 or 'columns' to collapse columns; the function returns row statistics.

In [223]:
# Column statistics for numpy array.

In [224]:
# Row statistics on the DataFrame.

In [225]:
# Another example.

## Sorting data

+ Means arranging in order; not separating.

In [241]:
# df.sort_values(col, ascending=False).
# Does NOT modify the original DataFrame.
# Notice the index after sorting.

In [240]:
# For descending order, add the argument ascending=False.

In [243]:
# Sort by more than one column using a list of cols. By default, ascending by all columns.

In [245]:
# To sort differnt columns differently, pass a list to the ascending argument with Boolean values;
# should be the same length as the number of columns.

## Grouping data
+ We can group data by a **categorical variable** using the `groupby()` method. (A categorical variable is one with a limited number of values.)
+ It produces a Grouped object; new object, the original DataFrame is unchanged.
+ On applying a function like `mean()` or `sum()` to the grouped object:
    + The column grouped on becomes the row index.
    + The summary is calculated for each group on all the numeric variables.
    + Exclude non-numeric columns using `numeric_only=True` when required.
+ To group by a hierarchy of more than one column, pass a list of columns to the `groupby()` method.