Python projects for data Science: 2019

Thursday, July 25, 2019

Extract standard deviation from open and high. Easily copy and paste code. Easily explained.

#Extract standard deviation from stock yearly

#Go to link to extract data
#https://finance.yahoo.com/quote/FB/history?p=FB&.tsrc=fin-tre-srch

#After that apply this formula to all H column =(C1/B1 - 1) * 100 and
#this to I column =(1 - D1/B1 ) * 100
#reason is to determine percentage variation. Next project automate this with python
#Your data should look like image below.

#facebook_2019
import pandas as pd #import libraries
import matplotlib as plt

#code below is from other year
#Before opening the excel, erase the headlines(e.g Date, open, .....) as it could
#cause problems to extract data. I will come back to reduce this manual job. hopefully with code.

# Procceed to the path of your csv (comma separated values or excel

path = "FB_2019.csv" # This code redirects to path of file,

df = pd.read_csv(path, header=None)
headers= ['date','open','High','low','close','Adj Close','Volume', 'range_high', 'range_low']

df.columns = headers

data2019 = df[['date','open','High','low','close','Adj Close','Volume','range_high', 'range_low']].describe()
#code above replace'date' or 'open' by 'low' or any other parameter in headers

print('data2019')
print(data2019)

year_of_selection = df[['open']]

import matplotlib.pyplot as plt
plt.plot(year_of_selection)
plt.ylabel('prices 2019')
plt.show()

#You could choose any column you want by modifying data 2019

Friday, May 31, 2019

Regression model of highway-mpg and price

#The code below is for a linear regression between miles per gallon(mpg) and price of vehicle

#Simple_regression

#just to remind you a # (hash) is used to comment then I will comment the code below within the #

#import libraries

import pandas as pd #useful to open csv comma separated values documents
import numpy as np #useful to classify data
import matplotlib.pyplot as plt #useful to graph
import seaborn as sns #useful to draw statistical data https://seaborn.pydata.org/

# path of data

path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'

#Code below reads and organizes the csv file with the panda library
df = pd.read_csv(path)
df.head()

# Create object
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm

X = df[['highway-mpg']] #this creates two list one with indexes and the other with highway-mpg
Y = df['price']

#determines the width of the screen of the graph
width = 12 #determines the width of the screen of the graph
height = 5

#code below graphs the data from the csv with matplotlib as plt and seaborn as sns
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

correlation = df[["peak-rpm","highway-mpg","price"]].corr()
print(str(correlation)+ ' this is correlation between peak-rpm,highway-mpg,price')

#We could conclude that the slope is negative. The prediction would be within that line and the range #of the MSE

#Below is the residual
#The residual helps you to determine the accuracy of the predictor

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

Tuesday, May 28, 2019

Ecuador Oil revenue

The data was download from https://tradingeconomics.com/commodity/brent-crude-oil. Thanks to his tracking this was possible.

The other data for estimates of revenue were obtained from https://contenido.bce.fin.ec/documentos/Estadisticas/Hidrocarburos/ASP201712.pdf an official government source.

The data obtained an average of production daily as data is not available or it was not found for this data analysis.

The average of daily production was set at 543'095.9 oil barrels multiplied by the price of WTI oil.
2015 analysis.
The labels are as follows:
date price open high low vol change

2016 analysis
date price open high low vol change

2017 analysis
date price open high low vol change

2018 analysis
date price open high low vol change

This method results in a very accurate estimation of actual revenues which are seen in the first chart.
This data is very important as oil is one of the main revenue and makes a big part of the economy. For prediction of data please visit Predicting data and other topics in this blog.

Saturday, May 25, 2019

Getting and Drawing data

Getting Data

I really love data from an early age. I remember the time my dad take me to a store where I found the

almanac of the world that was back then in 2005. That was a whole discovery. This introduces me into the question about getting the best data. An Almanac relies on official sources but where is the best place to find the best data. I write this post from Canada and there is several sources for information. Depending on your needs you could begin your research with the official statistics of your country.

Begin your data search in official sources like the statistics site of your city, or country.

You could also look academic sources like google scholar, Pubmed and research gate.

Look for available databases of universities as well.

I will write more on this in the future.

Drawing Data

To begin with you could use excel to have a better idea. The data below shows the histogram of the salaries in Toronto.

To open this file in python

>>> file = open('data.csv', r) # this creates the file with the data of a 'data.csv' document in python
>>> file.readline() #This prints the first line of the document

file = open('stats0.csv', 'r')

while input()!= 'end': #this code lets you see how the data is processed by clicking enter
a = file.readline()
print(a)

The data is not processed yet.
Read other tutorials to see how to manipulate data and prepare data.

Tuesday, May 21, 2019

Extract data from stocks yearly and compare to find patterns

The code below is a summary of a year analysis in stock.

You could extract the data from yahoo finance.

(Go to https://help.yahoo.com/kb/SLN2311.html for more info).

Then you put the data in the same folder of your code.

Download the data from two years (e.g 2018, 2019) and erase the headlines (e.g header, open, close)

to allow easy process of the data.

#datascience_0001
import pandas as pd     #import libraries
import matplotlib as plt

path = "amazon2017.csv"    #define the location of your document in the computer
df = pd.read_csv(path, header=None)

headers= ['date','open','High','low','close','Adj Close','Volume']

# code above erase headers to avoid problems, then add them as you decide



df.columns = headers


data2017 = df[['date','open','close','Adj Close','Volume']].describe()

# This code above gives you a summary of the column

print('data2017')
print(data2017)       #print the summary of data defined in data2017

primero = df[['open']]

import matplotlib.pyplot as plt
plt.plot(primero)
plt.ylabel('prices 2017')
plt.show()








#code below is from other year

path = "amazon2018.csv"                   # The code is the same as above,

#but changes only the path to document

df = pd.read_csv(path, header=None)
headers= ['date','open','High','low','close','Adj Close','Volume']



df.columns = headers


data2018 = df[['date','open','close','Adj Close','Volume']].describe()

#code above replace'date' or 'open' by 'low' or any other parameter in headers

print(data2018)

a = df[['open']]

import matplotlib.pyplot as plt
plt.plot(a)
plt.ylabel('prices 2018')
plt.show()

#The output of code is shown below.'open' and 'close' prices were chosen, also 'volume'.

#You could choose any column you want by modifying data 2018

Possible errors

>builtins.FileNotFoundError: File b'amazon2017.csv' does not exist

solution:

Download the data and open it. Save in the desktop or the same folder

that you are working in with your python program.

Thursday, May 16, 2019

extract data from stocks and graph it

# import libraries
import requests
import time
import http
from bs4 import BeautifulSoup
from datetime import date
import matplotlib.pyplot as plt

import json

jours = []
prix = []

def duplicate_quote ( x= str):

"""If it founf=ds onw quote it duplicates it

>>>duplicate_quote ("{'Thu Jan 24 04:28:23 2019': 52.64}" )
>>>('{"Thu Jan 24 04:28:23 2019": 52.64}' )
"""
if "'" in x:
j = x.replace("'", '"')
return j

#2 lineas codigo debajo vrean file primero
f1 = open('us_cad.txt', 'a')
f1.close

f1 = open('us_cad.txt', 'r')
texto = f1.read()
#texto
#"{'Thu Jan 24 04:28:23 2019': 52.64}"

#code below reverses the code and find the '{' tehn we get the last dicitonary we substract 1 to include last character becasue of slincing
if texto != '':
new_dic = texto [-(texto[::-1].find('{')) - 1 : ]
duplicate_quote(new_dic)
dic = json.loads(duplicate_quote(new_dic))
f1.close()
else:
dic = {}
precio = []
f1.close()
#time.ctime()
#'Thu Jan 24 03:51:29 2019'

today = date.today()
today

days = []

quote_page = 'http://30rates.com/usd-cad-forecast-canadian-dollar-to-us-dollar-forecast-tomorrow-week-month'

# query the website and return the html to the variable â€˜pageâ€™
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, 'html.parser')

# Take out the <div> of name and get its value
name_box = soup.find('strong')

name = name_box.text
print(name)

precio = []
precio.append(float(name))

days = []
days.append(today.strftime("%d/%m/%y"))

dias = []
dias.append(time.ctime())

#plt.plot(dias, precio)

#plt.plot([1,2,3], [3,4,5])

for i in range(len(dias)):
dic[dias[i]] = precio[i]

with open('us_cad.txt', 'w') as f:
f.write(str(dic))
f.close

for i in dic:
jours.append(i)
prix.append(dic[i])
plt.plot(jours, prix)

Wednesday, May 15, 2019

Generating random drawings with random module

To generate random graphs we need first to thinks about the range of the random functions.

(If you are new to the random module read this paragraph, otherwise follow to the next one) -- >
New to random read here. You just go to python interpreter and press
>>> import random.
if it is not installed you go to command prompt in Windows ad type python -m pip install
then proceed to install the remaining packages.

#install necessary packages
import matplotlib.pyplot as plt
from matplotlib import pyplot
import numpy as np
import random
import time

$this line determine how many lines you need

NUMERODELINEAS = 4

def crea_graph():
# Create the vectors X and Y
plotf = []

for i in range(NUMERODELINEAS):
plotf.append('plot%d,' %i)

print('this is i0' + str(i))

for i in range(10):
time.sleep(0)
print('this is i1 ' + str(i))
a = random.randint(-50,50)
b = random.randint(-55, 55)
print('this is a ' + str(a) + ' and b ' + str(b))
x = np.arange((a+1), 10 , 0.01)
print('este es super ' + str(x))
y = b+np.sin(x)

#for i in range(4):
# print("plot{}".format(i))

# for i in range(4):
# print('plot%d,' %i)

# genera random para hacer la curva de sin o onda mas amplia (c = random.randrange)
c = random.randrange(-5,7)
print('this is c random' + str(c))

for j in range(NUMERODELINEAS):
print(str(j)+'este es i')
plotf[j] = plt.plot(a+x*c,c*y)
print('this is j' + str(j))
time.sleep(0)

#= plt.plot(x,y)

#plot2 = plt.plot(x+1,y)
#plot2 = plt.plot(x*2,y)

# Show the plot
plt.show()
plt.ylabel('titulo en y')
crea_graph()

d = random.randint(0, 7)

#//crea loop for el codigo de generar funciones
for i in range(d):
crea_graph()