Wednesday, December 20, 2023

Birthday paradox probability calculations using python

 Intuiution



We want to know the probability that at least 2 people have their birthday on the same date. 


Then:


A = Event that at least 2 people have their birthday on the same date. 

A' = Event that people have their birthday on different dates. 


P(A) = 1 - P(A') 


P(A') = 



The interpretation is that 365/365 is the probability that one person has a different birthday. As it is the only one then all days are available. Probability of 1 person haveing a different birthday is 100%

Then the probability that the second person have a different birthday than the first one is 364/365
or ~0.997260273973.

Then the probability that the third person have a different birthday than the first one and the second is 364/365 *363/365 or ~ 0.991795834115

AND SO ON. 

As we are calculating the probability of A' = Event that people have their birthday on different dates. We should take 1 - P(A') to get P(A).

P(A) =  (For the 2 people) is 1 - ~0.997260273973 = 0.002739726027397249
P(A) =  (For the 3 people) is 1 - ~0.997260273973 * 0.994520547945. = 0.008204165884781345


Using python
for i in range(50):
    a = math.factorial(365)/math.factorial(365-i)  / 365**i
    print(str(1 - a) , "the probability of at leats two people have the same bithday with "  + str(i) +" people")

0.0 the probability of at leats two people have the same bithday with 0 people
0.0 the probability of at leats two people have the same bithday with 1 people
0.002739726027397249 the probability of at leats two people have the same bithday with 2 people
0.008204165884781345 the probability of at leats two people have the same bithday with 3 people
0.016355912466550326 the probability of at leats two people have the same bithday with 4 people
0.02713557369979358 the probability of at leats two people have the same bithday with 5 people
0.040462483649111536 the probability of at leats two people have the same bithday with 6 people
0.056235703095975365 the probability of at leats two people have the same bithday with 7 people
0.07433529235166902 the probability of at leats two people have the same bithday with 8 people
0.09462383388916673 the probability of at leats two people have the same bithday with 9 people
0.11694817771107757 the probability of at leats two people have the same bithday with 10 people
0.141141378321733 the probability of at leats two people have the same bithday with 11 people
0.16702478883806438 the probability of at leats two people have the same bithday with 12 people
0.19441027523242949 the probability of at leats two people have the same bithday with 13 people
0.22310251200497289 the probability of at leats two people have the same bithday with 14 people
0.25290131976368646 the probability of at leats two people have the same bithday with 15 people
0.2836040052528499 the probability of at leats two people have the same bithday with 16 people
0.31500766529656066 the probability of at leats two people have the same bithday with 17 people
0.34691141787178936 the probability of at leats two people have the same bithday with 18 people
0.37911852603153673 the probability of at leats two people have the same bithday with 19 people
0.41143838358058005 the probability of at leats two people have the same bithday with 20 people
0.4436883351652058 the probability of at leats two people have the same bithday with 21 people
0.4756953076625501 the probability of at leats two people have the same bithday with 22 people
0.5072972343239854 the probability of at leats two people have the same bithday with 23 people
0.5383442579145288 the probability of at leats two people have the same bithday with 24 people
0.5686997039694639 the probability of at leats two people have the same bithday with 25 people
0.598240820135939 the probability of at leats two people have the same bithday with 26 people
0.626859282263242 the probability of at leats two people have the same bithday with 27 people
0.6544614723423994 the probability of at leats two people have the same bithday with 28 people
0.680968537477777 the probability of at leats two people have the same bithday with 29 people
0.7063162427192686 the probability of at leats two people have the same bithday with 30 people
0.7304546337286438 the probability of at leats two people have the same bithday with 31 people
0.7533475278503207 the probability of at leats two people have the same bithday with 32 people
0.774971854175772 the probability of at leats two people have the same bithday with 33 people
0.7953168646201543 the probability of at leats two people have the same bithday with 34 people
0.8143832388747152 the probability of at leats two people have the same bithday with 35 people
0.8321821063798795 the probability of at leats two people have the same bithday with 36 people
0.8487340082163846 the probability of at leats two people have the same bithday with 37 people
0.8640678210821209 the probability of at leats two people have the same bithday with 38 people
0.878219664366722 the probability of at leats two people have the same bithday with 39 people
0.891231809817949 the probability of at leats two people have the same bithday with 40 people
0.9031516114817354 the probability of at leats two people have the same bithday with 41 people
0.9140304715618692 the probability of at leats two people have the same bithday with 42 people
0.9239228556561199 the probability of at leats two people have the same bithday with 43 people
0.9328853685514263 the probability of at leats two people have the same bithday with 44 people
0.940975899465775 the probability of at leats two people have the same bithday with 45 people
0.9482528433672547 the probability of at leats two people have the same bithday with 46 people
0.9547744028332994 the probability of at leats two people have the same bithday with 47 people
0.9605979728794224 the probability of at leats two people have the same bithday with 48 people
0.9657796093226765 the probability of at leats two people have the same bithday with 49 people



Bonus below:
rate of change of probability as there are more people





Sunday, January 15, 2023

Useful commands for R

How to remove or replace a comma or sign from a dataset


y_mod<- gsub("\\,", "", y)

> y

 [1] "$133,172" "$129,201" "$127,575" "$124,679" "$121,280" "$120,390"

 [7] "$118,391" "$117,548" "$116,638" "$116,564" "$116,434" "$116,253"

[13] "$116,252" "$113,536" "$113,325" "$112,851" "$112,813" "$112,746"

[19] "$112,238" "$112,151" "$112,113" "$111,346" "$109,286" "$109,256"

[25] "$108,610" "$106,369" "$106,296" "$105,665" "$104,159" "$103,247"

[31] "$103,087" "$102,923" "$102,736" "$102,461" "$101,968" "$101,620"

[37] "$101,284" "$101,195" "$99,712"  "$99,276"  "$98,776"  "$98,052" 

[43] "$97,389"  "$96,564"  "$94,548"  "$94,428"  "$93,556"  "$93,085" 

[49] "$89,464"  "$84,706" 


Then we use

> gsub("\\.", "", y)

 [1] "$133172" "$129201" "$127575" "$124679" "$121280" "$120390"

 [7] "$118391" "$117548" "$116638" "$116564" "$116434" "$116253"

[13] "$116252" "$113536" "$113325" "$112851" "$112813" "$112746"

[19] "$112238" "$112151" "$112113" "$111346" "$109286" "$109256"

[25] "$108610" "$106369" "$106296" "$105665" "$104159" "$103247"

[31] "$103087" "$102923" "$102736" "$102461" "$101968" "$101620"

[37] "$101284" "$101195" "$99712"  "$99276"  "$98776"  "$98052" 

[43] "$97389"  "$96564"  "$94548"  "$94428"  "$93556"  "$93085" 

[49] "$89464"  "$84706"

> gsub("\\,", "", y)               #this replaces the comma by an empty space



>y_mod<- gsub("\\$", "", y_mod)

> y_mod

 [1] "133172" "129201" "127575" "124679" "121280" "120390" "118391"

 [8] "117548" "116638" "116564" "116434" "116253" "116252" "113536"

[15] "113325" "112851" "112813" "112746" "112238" "112151" "112113"

[22] "111346" "109286" "109256" "108610" "106369" "106296" "105665"

[29] "104159" "103247" "103087" "102923" "102736" "102461" "101968"

[36] "101620" "101284" "101195" "99712"  "99276"  "98776"  "98052" 

[43] "97389"  "96564"  "94548"  "94428"  "93556"  "93085"  "89464" 

[50] "84706" 

> y_mod<- gsub("\\$", "", y_mod)             #this replaces the dollar sign by an empty space


Strings to numeric

new <- as.numeric(y_mod)

 [1] 133172 129201 127575 124679 121280 120390 118391 117548 116638

[10] 116564 116434 116253 116252 113536 113325 112851 112813 112746

[19] 112238 112151 112113 111346 109286 109256 108610 106369 106296

[28] 105665 104159 103247 103087 102923 102736 102461 101968 101620

[37] 101284 101195  99712  99276  98776  98052  97389  96564  94548

[46]  94428  93556  93085  89464  84706

new <- as.numeric(y_mod)              #this command converts string to numeric


How to extract data from the web with python data scientist salaries

 Salary of data scientist by country and city.


First we have to look for reliable webs. In my case I will use ziprecruiter for the US and indeed for Canada.

US

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-DATA-Scientist-Salary-by-State

    
Canada




https://ca.indeed.com/career/data-scientist/salaries

Now everytime, this numbers are changed we will extract the more recent ones. 

Method 1:

Use beautiful soup to save all fields

#Import all libraries needed
import requests
import urllib.request
import time
from bs4 import BeautifulSoup


# Set the URL you want to webscrape from
url = 'https://ca.indeed.com/career/data-scientist/salaries'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

all = []


for i in soup.findAll("span"):
    #print(i.text)
    all.append(i.text)

print(all)



The code above prints the desired quanitities that will be updated every time you run the code. 
From there we can locate only the average

>>> The average salary is $120,039 





Code below comprises all output of sapn

>>> print(all)
['Jobs', 'Salaries', 'Messages', 'Sign In', '', 'Post a Job', 'Profile', 'All Salaries', 'DATA Scientist Salary', 'What Is the Average DATA Scientist Salary by State', 'Within 25 miles of Toronto, CA', '\n          $39,500 - $52,9991% of jobs\n        ', '\n          $53,000 - $66,4993% of jobs\n        ', '\n          $66,500 - $79,9997% of jobs\n        ', '', '\n          $92,500 is the 25th percentile. Salaries below this are outliers.$80,000 - $93,49914% of jobs\n        ', '\n          $93,500 - $106,99914% of jobs\n        ', '\n          The average salary is $120,039 a year$107,000 - $120,49917% of jobs\n        ', '\n          $120,500 - $133,99913% of jobs\n        ', '', '\n          $141,000 is the 75th percentile. Salaries above this are outliers.$134,000 - $147,49911% of jobs\n        ', '\n          $147,500 - $160,9997% of jobs\n        ', '', '\n          $169,000 is the 90th percentile. Salaries above this are outliers.$161,000 - $174,4994% of jobs\n        ', '\n          $174,500 - $188,0003% of jobs\n        ', '$39,500', '$120,039\n      /year\n', '/year', '$188,000', 'Data Scientist', 'Altocloud', 'Toronto, ON', 'Data Scientist, Risk', 'Square', 'Toronto, ON', 'Senior Data Scientist, Business Intelligence (English Services)', 'CBC Radio Canada', 'Toronto, ON', 'Associate/Senior Associate, Data Scientist, Portfolio Value Creation', 'CPP Investments', 'Toronto, ON', 'Data Scientist, Consultant', 'Project X', 'Toronto, ON', 'Senior Machine Learning Engineer / Data Scientist', 'Paytm Labs', 'Toronto, ON', 'Data Scientist', 'CARFAX', 'Toronto, ON', 'Data Scientist, MIR (English Services)', 'CBC Radio Canada', 'Toronto, ON', 'Senior Data Scientist', 'Borrowell', 'Toronto, ON', 'Lead Data Scientist (Predictive Maintenance)', 'Fusemachines', 'Toronto, ON', ' in the Toronto, CA area', ' in the Toronto, CA area:', 'ZipRecruiter, Inc. © 2023 All Rights Reserved Worldwide', 'New York', '$133,172', '$11,097', '$2,561', '$64.03', 'Idaho', '$129,201', '$10,766', '$2,484', '$62.12', 'California', '$127,575', '$10,631', '$2,453', '$61.33', 'New Hampshire', '$124,679', '$10,389', '$2,397', '$59.94', 'Vermont', '$121,280', '$10,106', '$2,332', '$58.31', 'Maine', '$120,390', '$10,032', '$2,315', '$57.88', 'Massachusetts', '$118,391', '$9,865', '$2,276', '$56.92', 'Hawaii', '$117,548', '$9,795', '$2,260', '$56.51', 'Tennessee', '$116,638', '$9,719', '$2,243', '$56.08', 'Nevada', '$116,564', '$9,713', '$2,241', '$56.04', 'Wyoming', '$116,434', '$9,702', '$2,239', '$55.98', 'Washington', '$116,253', '$9,687', '$2,235', '$55.89', 'Arizona', '$116,252', '$9,687', '$2,235', '$55.89', 'Connecticut', '$113,536', '$9,461', '$2,183', '$54.58', 'Montana', '$113,325', '$9,443', '$2,179', '$54.48', 'Rhode Island', '$112,851', '$9,404', '$2,170', '$54.26', 'Indiana', '$112,813', '$9,401', '$2,169', '$54.24', 'New Jersey', '$112,746', '$9,395', '$2,168', '$54.20', 'Alaska', '$112,238', '$9,353', '$2,158', '$53.96', 'Minnesota', '$112,151', '$9,345', '$2,156', '$53.92', 'West Virginia', '$112,113', '$9,342', '$2,156', '$53.90', 'Oregon', '$111,346', '$9,278', '$2,141', '$53.53', 'Maryland', '$109,286', '$9,107', '$2,101', '$52.54', 'North Dakota', '$109,256', '$9,104', '$2,101', '$52.53', 'Pennsylvania', '$108,610', '$9,050', '$2,088', '$52.22', 'Wisconsin', '$106,369', '$8,864', '$2,045', '$51.14', 'Virginia', '$106,296', '$8,858', '$2,044', '$51.10', 'Ohio', '$105,665', '$8,805', '$2,032', '$50.80', 'Iowa', '$104,159', '$8,679', '$2,003', '$50.08', 'Nebraska', '$103,247', '$8,603', '$1,985', '$49.64', 'South Dakota', '$103,087', '$8,590', '$1,982', '$49.56', 'Colorado', '$102,923', '$8,576', '$1,979', '$49.48', 'Kentucky', '$102,736', '$8,561', '$1,975', '$49.39', 'Delaware', '$102,461', '$8,538', '$1,970', '$49.26', 'Utah', '$101,968', '$8,497', '$1,960', '$49.02', 'Alabama', '$101,620', '$8,468', '$1,954', '$48.86', 'New Mexico', '$101,284', '$8,440', '$1,947', '$48.69', 'South Carolina', '$101,195', '$8,432', '$1,946', '$48.65', 'Kansas', '$99,712', '$8,309', '$1,917', '$47.94', 'Florida', '$99,276', '$8,273', '$1,909', '$47.73', 'Arkansas', '$98,776', '$8,231', '$1,899', '$47.49', 'Oklahoma', '$98,052', '$8,171', '$1,885', '$47.14', 'Mississippi', '$97,389', '$8,115', '$1,872', '$46.82', 'Michigan', '$96,564', '$8,047', '$1,857', '$46.43', 'Missouri', '$94,548', '$7,879', '$1,818', '$45.46', 'Texas', '$94,428', '$7,869', '$1,815', '$45.40', 'Georgia', '$93,556', '$7,796', '$1,799', '$44.98', 'Illinois', '$93,085', '$7,757', '$1,790', '$44.75', 'Louisiana', '$89,464', '$7,455', '$1,720', '$43.01', 'North Carolina', '$84,706', '$7,058', '$1,628', '$40.72']



You can also try

table= soup.find('table', {'class': 'salary_by_state_table'})


to get an organized table values

<table class="salary_by_state_table">
<thead>
<tr>
<th class="col1">State</th>
<th class="col2">Annual Salary</th>
<th class="col3">Monthly Pay</th>
<th class="col4">Weekly Pay</th>
<th class="col5">Hourly Wage</th>
</tr>
</thead>
<tbody>
<tr>
<td class="col1">New York</td>
<td class="col2">$133,172</td>
<td class="col3">$11,097</td>
<td class="col4">$2,561</td>
<td class="col5">$64.03</td>
</tr>
<tr>
<td class="col1">Idaho</td>
<td class="col2">$129,201</td>
<td class="col3">$10,766</td>
<td class="col4">$2,484</td>
<td class="col5">$62.12</td>
</tr>
<tr>
<td class="col1">California</td>
<td class="col2">$127,575</td>
<td class="col3">$10,631</td>
<td class="col4">$2,453</td>
<td class="col5">$61.33</td>
</tr>
<tr>
<td class="col1">New Hampshire</td>
<td class="col2">$124,679</td>
<td class="col3">$10,389</td>
<td class="col4">$2,397</td>
<td class="col5">$59.94</td>
</tr>

.......


From here you can convert the data to csv 
Website below could be useful to convert from table html to csv
https://www.convertcsv.com/html-table-to-csv.htm

State,Annual Salary,Monthly Pay,Weekly Pay,Hourly Wage
New York,"$133,172","$11,097","$2,561",$64.03
Idaho,"$129,201","$10,766","$2,484",$62.12
California,"$127,575","$10,631","$2,453",$61.33
New Hampshire,"$124,679","$10,389","$2,397",$59.94
Vermont,"$121,280","$10,106","$2,332",$58.31
Maine,"$120,390","$10,032","$2,315",$57.88
Massachusetts,"$118,391","$9,865","$2,276",$56.92
Hawaii,"$117,548","$9,795","$2,260",$56.51
Tennessee,"$116,638","$9,719","$2,243",$56.08
Nevada,"$116,564","$9,713","$2,241",$56.04
Wyoming,"$116,434","$9,702","$2,239",$55.98
Washington,"$116,253","$9,687","$2,235",$55.89
Arizona,"$116,252","$9,687","$2,235",$55.89
Connecticut,"$113,536","$9,461","$2,183",$54.58
Montana,"$113,325","$9,443","$2,179",$54.48
Rhode Island,"$112,851","$9,404","$2,170",$54.26
Indiana,"$112,813","$9,401","$2,169",$54.24
New Jersey,"$112,746","$9,395","$2,168",$54.20
Alaska,"$112,238","$9,353","$2,158",$53.96
Minnesota,"$112,151","$9,345","$2,156",$53.92
West Virginia,"$112,113","$9,342","$2,156",$53.90
Oregon,"$111,346","$9,278","$2,141",$53.53
Maryland,"$109,286","$9,107","$2,101",$52.54
North Dakota,"$109,256","$9,104","$2,101",$52.53
Pennsylvania,"$108,610","$9,050","$2,088",$52.22
Wisconsin,"$106,369","$8,864","$2,045",$51.14
Virginia,"$106,296","$8,858","$2,044",$51.10
Ohio,"$105,665","$8,805","$2,032",$50.80
Iowa,"$104,159","$8,679","$2,003",$50.08
Nebraska,"$103,247","$8,603","$1,985",$49.64
South Dakota,"$103,087","$8,590","$1,982",$49.56
Colorado,"$102,923","$8,576","$1,979",$49.48
Kentucky,"$102,736","$8,561","$1,975",$49.39
Delaware,"$102,461","$8,538","$1,970",$49.26
Utah,"$101,968","$8,497","$1,960",$49.02
Alabama,"$101,620","$8,468","$1,954",$48.86
New Mexico,"$101,284","$8,440","$1,947",$48.69
South Carolina,"$101,195","$8,432","$1,946",$48.65
Kansas,"$99,712","$8,309","$1,917",$47.94
Florida,"$99,276","$8,273","$1,909",$47.73
Arkansas,"$98,776","$8,231","$1,899",$47.49
Oklahoma,"$98,052","$8,171","$1,885",$47.14
Mississippi,"$97,389","$8,115","$1,872",$46.82
Michigan,"$96,564","$8,047","$1,857",$46.43
Missouri,"$94,548","$7,879","$1,818",$45.46
Texas,"$94,428","$7,869","$1,815",$45.40
Georgia,"$93,556","$7,796","$1,799",$44.98
Illinois,"$93,085","$7,757","$1,790",$44.75
Louisiana,"$89,464","$7,455","$1,720",$43.01
North Carolina,"$84,706","$7,058","$1,628",$40.72