Tuesday, July 14, 2020

ML Series: Multiple Linear Regression in Python

Machine Learning 1.02

Last time we looked at simple linear regression in python. In this post we will look at multiple linear regression. In this case we have a real estate pricing dataset where we have house prices, size of the house and the year. Our hypothesis is that size and year can enable us to predict the price of the house! Using the model we will make a prediction about an apartment with size 750 sq.ft. from 2009

1. Step 1 will always be the same as always: Importing the relevant libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression

2. Importing the dataset

data = pd.read_csv('C:/Users/abc/Downloads/real_estate_price_size_year.csv')
data.head()



3. Now will will use 2 independent variables to define our regression expression. Size and year shall be our independent variable and we shall check if the price of a house is dependent on these 2 variables

x = data[['size','year']]
y = data['price']


4. Now we fit the linear regression model
reg = LinearRegression()
reg.fit(x,y)


Note that in simple linear regression we had to convert the x variable into a matrix because sklearn demands an array. But note that Sklearn has been optimized for multiple linear regression and we can simply run just 2 lines of code here.

5. Then we check if our model gives significant results and is a good fit (R-Squared)

reg.score(x,y)

Out: 0.7764803683276793 which is quite significant. 6. We can now start predicting our house prices based on the values(of size and year) reg.predict([[750,2009]]) Out: array([258330.34465995])

Although our machine learning model (multiple linear regression in this case) is completed here. We can go further and perform more checks like f regression to see which of the independent variable is explaining the model more accurately i.e. by creating the univariate p values of the variables. 

Thanks.

ML Series: Simple Linear Regression in Python

Machine Learning 1.01
In this series we are going to look at the basics of machine learning concepts from stats to running a few models and applying on actual data.

Lets first start learning simple linear regression in Python:
Case: We will create a regression which will predict the GPA based on SAT scores obtained by the students.

Sample Data (csv): 

SAT GPA
1714 2.4
1664 2.52
1760 2.54
1685 2.74
1693 2.83
1670 2.91
1764 3
1764 3
1792 3.01
1850 3.01
1735 3.02
1775 3.07

1. Importing the libraries (These are the most common and important libraries which you need to import at most of the times - pandas, numpy, matplotlib, seaborn and linear regression from sklearn package)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
From sklearn.linear_model import LinearRegression
import seaborn as sns
seaborn.set()

2. Loading the data

data = pd.read_csv('C:/Users/abc/Downloads/1.01. Simple linear regression.csv')
data.head()

The data has 2 variable GPA & SAT. The notion is SAT can predict the GPA of a student. We can test our hypothesis and even predict the GPA if in case SAT comes out to be a good predictor of GPA.



3. Setting up the model by defining the dependent variable GPA as y and independent variable as y. Thus we can define the linear regression line as y = b0 + b1*x  where b0 & b1 are the constants

x = data['SAT']
y = data['GPA']
x.shape
y.shape

4. Now the Regression model in Sklearn takes only array as inputs. Thus we will convert the x to 'x_matrix' and see the shape

x_matrix = x.values.reshape(-1,1)
x_matrix.shape

5. The Regression - We need to set a variable to LinearRegression() function and fit the model for our x_matrix & y variables

reg = LinearRegression()
reg.fit(x_matrix,y)

Out: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

6. We can check the results to see if our model is a good fit or not (R-Squared)

reg.score(x_matrix,y)

which gives us a score of 40% which is not so bad! 
It means 40% of the data variability is explained by the model.


7. All Done. Now we can simply predict the GPA for any SAT

reg.predict([[1740]])

Out:  array([3.15593751])

We will discuss about multiple linear regression in the next post.
Thanks

Sunday, July 12, 2020

Time Intelligence in Power BI

Today we are gonna talk about the most important aspects - Time Intelligence. In Fact you will most probably use this script in most of the scenarios or analysis you do. The script we are gonna use today is the most common standard and one of the most handy scripts you should use while starting your analysis.

This script creates a date table in an instant according to the time difference of your choice and even lets you choose the FY beginnings. As in if your financial year starts from July or April you can simply let the script know and it will update the whole table accordingly. 



let CreateDateTable = (StartDate as date, EndDate as date, optional Culture as nullable text) as table =>

let

DayCount = Duration.Days(Duration.From(EndDate - StartDate)),

Source = List.Dates(StartDate,DayCount,#duration(1,0,0,0)),

TableFromList = Table.FromList(Source, Splitter.SplitByNothing()),

ChangedType = Table.TransformColumnTypes(TableFromList,{{"Column1", type date}}),

RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}}),

InsertYear = Table.AddColumn(RenamedColumns, "Year", each Date.Year([Date]), type number),

InsertQuarter = Table.AddColumn(InsertYear, "Quarter Num", each Date.QuarterOfYear([Date]), type number),

InsertCalendarQtr = Table.AddColumn(InsertQuarter, "Quarter Year", each "Q" & Number.ToText([Quarter Num]) & " " & Number.ToText([Year]),type text),

InsertCalendarQtrOrder = Table.AddColumn(InsertCalendarQtr, "Quarter Year Order", each [Year] * 10 + [Quarter Num], type number),

InsertMonth = Table.AddColumn(InsertCalendarQtrOrder, "Month Num", each Date.Month([Date]), type number),

InsertMonthName = Table.AddColumn(InsertMonth, "Month Name", each Date.ToText([Date], "MMMM", Culture), type text),

InsertMonthNameShort = Table.AddColumn(InsertMonthName, "Month Name Short", each Date.ToText([Date], "MMM", Culture), type text),

InsertCalendarMonth = Table.AddColumn(InsertMonthNameShort, "Month Year", each (try(Text.Range([Month Name],0,3)) otherwise [Month Name]) & " " & Number.ToText([Year]), type text),

InsertCalendarMonthOrder = Table.AddColumn(InsertCalendarMonth, "Month Year Order", each [Year] * 100 + [Month Num], type number),

InsertWeek = Table.AddColumn(InsertCalendarMonthOrder, "Week Num", each Date.WeekOfYear([Date]), type number),

InsertCalendarWk = Table.AddColumn(InsertWeek, "Week Year", each "W" & Number.ToText([Week Num]) & " " & Number.ToText([Year]), type text),

InsertCalendarWkOrder = Table.AddColumn(InsertCalendarWk, "Week Year Order", each [Year] * 100 + [Week Num], type number),

InsertWeekEnding = Table.AddColumn(InsertCalendarWkOrder, "Week Ending", each Date.EndOfWeek([Date]), type date),

InsertDay = Table.AddColumn(InsertWeekEnding, "Month Day Num", each Date.Day([Date]), type number),

InsertDayInt = Table.AddColumn(InsertDay, "Date Int", each [Year] * 10000 + [Month Num] * 100 + [Month Day Num], type number),

InsertDayWeek = Table.AddColumn(InsertDayInt, "Day Num Week", each Date.DayOfWeek([Date]) + 1, type number),

InsertDayName = Table.AddColumn(InsertDayWeek, "Day Name", each Date.ToText([Date], "dddd", Culture), type text),

InsertWeekend = Table.AddColumn(InsertDayName, "Weekend", each if [Day Num Week] = 1 then "Y" else if [Day Num Week] = 7 then "Y" else "N", type text),

InsertDayNameShort = Table.AddColumn(InsertWeekend, "Day Name Short", each Date.ToText([Date], "ddd", Culture), type text),

InsertIndex = Table.AddIndexColumn(InsertDayNameShort, "Index", 1, 1),

InsertDayOfYear = Table.AddColumn(InsertIndex, "Day of Year", each Date.DayOfYear([Date]), type number),

InsertCurrentDay = Table.AddColumn(InsertDayOfYear, "Current Day?", each Date.IsInCurrentDay([Date]), type logical),

InsertCurrentWeek = Table.AddColumn(InsertCurrentDay, "Current Week?", each Date.IsInCurrentWeek([Date]), type logical),

InsertCurrentMonth = Table.AddColumn(InsertCurrentWeek, "Current Month?", each Date.IsInCurrentMonth([Date]), type logical),

InsertCurrentQuarter = Table.AddColumn(InsertCurrentMonth, "Current Quarter?", each Date.IsInCurrentQuarter([Date]), type logical),

InsertCurrentYear = Table.AddColumn(InsertCurrentQuarter, "Current Year?", each Date.IsInCurrentYear([Date]), type logical),

InsertCompletedDay = Table.AddColumn(InsertCurrentYear, "Completed Days", each if DateTime.Date(DateTime.LocalNow()) > [Date] then "Y" else "N", type text),

InsertCompletedWeek = Table.AddColumn(InsertCompletedDay, "Completed Weeks", each if (Date.Year(DateTime.Date(DateTime.LocalNow())) > Date.Year([Date])) then "Y" else if (Date.Year(DateTime.Date(DateTime.LocalNow())) < Date.Year([Date])) then "N" else if (Date.WeekOfYear(DateTime.Date(DateTime.LocalNow())) > Date.WeekOfYear([Date])) then "Y" else "N", type text),

InsertCompletedMonth = Table.AddColumn(InsertCompletedWeek, "Completed Months", each if (Date.Year(DateTime.Date(DateTime.LocalNow())) > Date.Year([Date])) then "Y" else if (Date.Year(DateTime.Date(DateTime.LocalNow())) < Date.Year([Date])) then "N" else if (Date.Month(DateTime.Date(DateTime.LocalNow())) > Date.Month([Date])) then "Y" else "N", type text),

InsertCompletedQuarter = Table.AddColumn(InsertCompletedMonth, "Completed Quarters", each if (Date.Year(DateTime.Date(DateTime.LocalNow())) > Date.Year([Date])) then "Y" else if (Date.Year(DateTime.Date(DateTime.LocalNow())) < Date.Year([Date])) then "N" else if (Date.QuarterOfYear (DateTime.Date(DateTime.LocalNow()))> Date.QuarterOfYear([Date])) then "Y" else "N", type text),

InsertCompletedYear = Table.AddColumn(InsertCompletedQuarter, "Completed Years", each if (Date.Year(DateTime.Date(DateTime.LocalNow())) > Date.Year([Date])) then "Y" else "N", type text)

in

InsertCompletedYear

in

CreateDateTable



 This will give you an option to choose your starting and ending dates and also to choose your 
FY start period.

Then you can easily see your whole table created and you can use. 
Try it at your end post any queries you have!

Sunday, July 5, 2020

SUMMARIZE AND SUMMARIZECOLUMNS in Power BI

Today we are going to see SUMMARIZE, ADDCOLUMNS and SUMMARIZECOLUMNS:

Extension columns are columns that you add to existing tables. You can obtain extension columns by 
using both ADDCOLUMNS and SUMMARIZE. Basically these functions adds columns / tables over the conditions applied.

For example, the following query adds a Year Production column to the rows returned from the Product table.

EVALUATE
ADDCOLUMNS(
    Product,
    "Year Production", YEAR( Product[Product Start Date] )
)

In below query, you can count the number of products for each product category by using the following query

EVALUATE
SUMMARIZE(
    Product,
    Product[Product Category Name],
    "Products", COUNTROWS( Product )
)


For performance reasons it is better to add ADDCOLUMNS() outside. 

EVALUATE
ADDCOLUMNS(
    SUMMARIZE(
        Product,
        Product[Product Name],
        Product[Product Start Date]
    ),
    "Year Production", YEAR( Product[Product Start Date] )
)
SUMMARIZECOLUMNSReturns a summary table over a set of groups.

Below query returns the sumofsales for calendar year and category name
Summarize =
SUMMARIZECOLUMNS(
    'Calendar'[Year],
    Categories[CategoryName],
    "sumofsales", [Sales])


Thanks, :)

Saturday, July 4, 2020

Some Cool DAX inbuilt functions

Today lets learn about some important and powerful DAX functions which will help you a lot in day to day calculations:

1. CALCULATE - Calculate enables you to do a calculation having any condition applied

For e.g. below measure, calculates YTD Sales for the same period last year period

PY Sales : CALCULATE([YTD Sales],SAMEPERIODLASTYEAR(DimDate[Date]))
Below will do calculation with all product names ignoring any filters: Total Sales : CALCULATE([Total Sales], All(products[productname])) 2. Similarly SAMEPERIODLAST is an intitutive function which lets you choose the exact same period as you see this year

Basically, it returns a table that contains a column of dates shifted one year back in time from the dates in the specified dates column, in the current context.

So in the given measure, before the CALCULATE Function does the calculation,  SAMEPERIODLASTYEAR returns the previous year date from DimDate table. Then Calculation for YTD Sales is done for that year.
PY Sales : CALCULATE([YTD Sales],SAMEPERIODLASTYEAR(DimDate[Date]))
3. ISFILTERED() - can check whether a column is being filtered directly or if any of the columns of the table is being filtered directly.
In the below measure, ISFILTERED() will check if there is a value selected/filtered on the year slicer, the calculation is done, else the string is returned. PY sales no year selected wtext = IF(ISFILTERED(DimDate[Year]),     CALCULATE([YTD Sales],SAMEPERIODLASTYEAR(DimDate[Date])),"Select a year in slicer")
4. HASONEVALUE() - Returns true when there’s only one value in the specified column.

5. FIRSTNONBLANK() and LASTNONBLANK()Returns the first and last value in the column for which the expression has a non blank value respectively.

Calculates the first date which has non blank sales value.
firstnonblank 1 = FIRSTNONBLANK(Sales[Date],SUM(Sales[Sales]))

The below measure calculates, first non blank month which has total sales value

Firstnonblank month = FIRSTNONBLANK(Sales[Month], [Tot sales])


The difference between FIRSTNONBLANK() and FIRSTNONBLANKVALUE()
FIRSTNONBLANK() returns first non blank value of a column, while FIRSTNONBLANKVALUE() returns the first non blank value of an expression. You have to be careful to sort the data on right column and check results.

E.g. gives the first non blank month having the sales 
Firstnonblankvalue year = FIRSTNONBLANKVALUE(Sales[Month], SUM(Sales[Sales]))
Thanks :)

Using aggregate DAX functions

Today we are going to learn some of the most important DAX functions such as SUM, SUMX


1. You can calculate the Total Sales using a normal multiplication (unit price and quantity) and creating a column in the data.

TotSalesColumn = Order_Details[UnitPrice]*(1-Order_Details[Discount])* Order_Details[Quantity]


2. Now you can use SUM function and calculate the total sales
Total Sales = SUM(Order_Details[TotSalesColumn])

Using SUM and SUMX will give the same amount finally, while you have to add one step extra to calculate the total of TotSalesColumn. 
SUMX can directly calculate the Total Sales using Unit price and quantity of order details

3. Total Sales X: SUMX 

Total Sales X = SUMX(Order_Details,Order_Details[UnitPrice]*(1-Order_Details[Discount])* Order_Details[Quantity])




Using most common aggregate functions:

1. MinSale = MIN(EmployeeSales[Sales] 2. MaxSale = MAX(EmployeeSales[Sales]) 3. CountSales = COUNT(EmployeeSales[NewSale])

Finds the standard deviance of selected records
4. StdevPIncome = STDEV.P(EmployeeSales[YearlyIncome]) Finds the statistical variance of selected or some records 5. VarPIncome = VAR.P(EmployeeSales[YearlyIncome]) To find the statistical Variance of complete records 6. VarSIncome = VAR.S(EmployeeSales[YearlyIncome]) Thanks,