Sunday, January 15, 2023

Useful commands for R

How to remove or replace a comma or sign from a dataset


y_mod<- gsub("\\,", "", y)

> y

 [1] "$133,172" "$129,201" "$127,575" "$124,679" "$121,280" "$120,390"

 [7] "$118,391" "$117,548" "$116,638" "$116,564" "$116,434" "$116,253"

[13] "$116,252" "$113,536" "$113,325" "$112,851" "$112,813" "$112,746"

[19] "$112,238" "$112,151" "$112,113" "$111,346" "$109,286" "$109,256"

[25] "$108,610" "$106,369" "$106,296" "$105,665" "$104,159" "$103,247"

[31] "$103,087" "$102,923" "$102,736" "$102,461" "$101,968" "$101,620"

[37] "$101,284" "$101,195" "$99,712"  "$99,276"  "$98,776"  "$98,052" 

[43] "$97,389"  "$96,564"  "$94,548"  "$94,428"  "$93,556"  "$93,085" 

[49] "$89,464"  "$84,706" 


Then we use

> gsub("\\.", "", y)

 [1] "$133172" "$129201" "$127575" "$124679" "$121280" "$120390"

 [7] "$118391" "$117548" "$116638" "$116564" "$116434" "$116253"

[13] "$116252" "$113536" "$113325" "$112851" "$112813" "$112746"

[19] "$112238" "$112151" "$112113" "$111346" "$109286" "$109256"

[25] "$108610" "$106369" "$106296" "$105665" "$104159" "$103247"

[31] "$103087" "$102923" "$102736" "$102461" "$101968" "$101620"

[37] "$101284" "$101195" "$99712"  "$99276"  "$98776"  "$98052" 

[43] "$97389"  "$96564"  "$94548"  "$94428"  "$93556"  "$93085" 

[49] "$89464"  "$84706"

> gsub("\\,", "", y)               #this replaces the comma by an empty space



>y_mod<- gsub("\\$", "", y_mod)

> y_mod

 [1] "133172" "129201" "127575" "124679" "121280" "120390" "118391"

 [8] "117548" "116638" "116564" "116434" "116253" "116252" "113536"

[15] "113325" "112851" "112813" "112746" "112238" "112151" "112113"

[22] "111346" "109286" "109256" "108610" "106369" "106296" "105665"

[29] "104159" "103247" "103087" "102923" "102736" "102461" "101968"

[36] "101620" "101284" "101195" "99712"  "99276"  "98776"  "98052" 

[43] "97389"  "96564"  "94548"  "94428"  "93556"  "93085"  "89464" 

[50] "84706" 

> y_mod<- gsub("\\$", "", y_mod)             #this replaces the dollar sign by an empty space


Strings to numeric

new <- as.numeric(y_mod)

 [1] 133172 129201 127575 124679 121280 120390 118391 117548 116638

[10] 116564 116434 116253 116252 113536 113325 112851 112813 112746

[19] 112238 112151 112113 111346 109286 109256 108610 106369 106296

[28] 105665 104159 103247 103087 102923 102736 102461 101968 101620

[37] 101284 101195  99712  99276  98776  98052  97389  96564  94548

[46]  94428  93556  93085  89464  84706

new <- as.numeric(y_mod)              #this command converts string to numeric


How to extract data from the web with python data scientist salaries

 Salary of data scientist by country and city.


First we have to look for reliable webs. In my case I will use ziprecruiter for the US and indeed for Canada.

US

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-DATA-Scientist-Salary-by-State

    
Canada




https://ca.indeed.com/career/data-scientist/salaries

Now everytime, this numbers are changed we will extract the more recent ones. 

Method 1:

Use beautiful soup to save all fields

#Import all libraries needed
import requests
import urllib.request
import time
from bs4 import BeautifulSoup


# Set the URL you want to webscrape from
url = 'https://ca.indeed.com/career/data-scientist/salaries'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

all = []


for i in soup.findAll("span"):
    #print(i.text)
    all.append(i.text)

print(all)



The code above prints the desired quanitities that will be updated every time you run the code. 
From there we can locate only the average

>>> The average salary is $120,039 





Code below comprises all output of sapn

>>> print(all)
['Jobs', 'Salaries', 'Messages', 'Sign In', '', 'Post a Job', 'Profile', 'All Salaries', 'DATA Scientist Salary', 'What Is the Average DATA Scientist Salary by State', 'Within 25 miles of Toronto, CA', '\n          $39,500 - $52,9991% of jobs\n        ', '\n          $53,000 - $66,4993% of jobs\n        ', '\n          $66,500 - $79,9997% of jobs\n        ', '', '\n          $92,500 is the 25th percentile. Salaries below this are outliers.$80,000 - $93,49914% of jobs\n        ', '\n          $93,500 - $106,99914% of jobs\n        ', '\n          The average salary is $120,039 a year$107,000 - $120,49917% of jobs\n        ', '\n          $120,500 - $133,99913% of jobs\n        ', '', '\n          $141,000 is the 75th percentile. Salaries above this are outliers.$134,000 - $147,49911% of jobs\n        ', '\n          $147,500 - $160,9997% of jobs\n        ', '', '\n          $169,000 is the 90th percentile. Salaries above this are outliers.$161,000 - $174,4994% of jobs\n        ', '\n          $174,500 - $188,0003% of jobs\n        ', '$39,500', '$120,039\n      /year\n', '/year', '$188,000', 'Data Scientist', 'Altocloud', 'Toronto, ON', 'Data Scientist, Risk', 'Square', 'Toronto, ON', 'Senior Data Scientist, Business Intelligence (English Services)', 'CBC Radio Canada', 'Toronto, ON', 'Associate/Senior Associate, Data Scientist, Portfolio Value Creation', 'CPP Investments', 'Toronto, ON', 'Data Scientist, Consultant', 'Project X', 'Toronto, ON', 'Senior Machine Learning Engineer / Data Scientist', 'Paytm Labs', 'Toronto, ON', 'Data Scientist', 'CARFAX', 'Toronto, ON', 'Data Scientist, MIR (English Services)', 'CBC Radio Canada', 'Toronto, ON', 'Senior Data Scientist', 'Borrowell', 'Toronto, ON', 'Lead Data Scientist (Predictive Maintenance)', 'Fusemachines', 'Toronto, ON', ' in the Toronto, CA area', ' in the Toronto, CA area:', 'ZipRecruiter, Inc. © 2023 All Rights Reserved Worldwide', 'New York', '$133,172', '$11,097', '$2,561', '$64.03', 'Idaho', '$129,201', '$10,766', '$2,484', '$62.12', 'California', '$127,575', '$10,631', '$2,453', '$61.33', 'New Hampshire', '$124,679', '$10,389', '$2,397', '$59.94', 'Vermont', '$121,280', '$10,106', '$2,332', '$58.31', 'Maine', '$120,390', '$10,032', '$2,315', '$57.88', 'Massachusetts', '$118,391', '$9,865', '$2,276', '$56.92', 'Hawaii', '$117,548', '$9,795', '$2,260', '$56.51', 'Tennessee', '$116,638', '$9,719', '$2,243', '$56.08', 'Nevada', '$116,564', '$9,713', '$2,241', '$56.04', 'Wyoming', '$116,434', '$9,702', '$2,239', '$55.98', 'Washington', '$116,253', '$9,687', '$2,235', '$55.89', 'Arizona', '$116,252', '$9,687', '$2,235', '$55.89', 'Connecticut', '$113,536', '$9,461', '$2,183', '$54.58', 'Montana', '$113,325', '$9,443', '$2,179', '$54.48', 'Rhode Island', '$112,851', '$9,404', '$2,170', '$54.26', 'Indiana', '$112,813', '$9,401', '$2,169', '$54.24', 'New Jersey', '$112,746', '$9,395', '$2,168', '$54.20', 'Alaska', '$112,238', '$9,353', '$2,158', '$53.96', 'Minnesota', '$112,151', '$9,345', '$2,156', '$53.92', 'West Virginia', '$112,113', '$9,342', '$2,156', '$53.90', 'Oregon', '$111,346', '$9,278', '$2,141', '$53.53', 'Maryland', '$109,286', '$9,107', '$2,101', '$52.54', 'North Dakota', '$109,256', '$9,104', '$2,101', '$52.53', 'Pennsylvania', '$108,610', '$9,050', '$2,088', '$52.22', 'Wisconsin', '$106,369', '$8,864', '$2,045', '$51.14', 'Virginia', '$106,296', '$8,858', '$2,044', '$51.10', 'Ohio', '$105,665', '$8,805', '$2,032', '$50.80', 'Iowa', '$104,159', '$8,679', '$2,003', '$50.08', 'Nebraska', '$103,247', '$8,603', '$1,985', '$49.64', 'South Dakota', '$103,087', '$8,590', '$1,982', '$49.56', 'Colorado', '$102,923', '$8,576', '$1,979', '$49.48', 'Kentucky', '$102,736', '$8,561', '$1,975', '$49.39', 'Delaware', '$102,461', '$8,538', '$1,970', '$49.26', 'Utah', '$101,968', '$8,497', '$1,960', '$49.02', 'Alabama', '$101,620', '$8,468', '$1,954', '$48.86', 'New Mexico', '$101,284', '$8,440', '$1,947', '$48.69', 'South Carolina', '$101,195', '$8,432', '$1,946', '$48.65', 'Kansas', '$99,712', '$8,309', '$1,917', '$47.94', 'Florida', '$99,276', '$8,273', '$1,909', '$47.73', 'Arkansas', '$98,776', '$8,231', '$1,899', '$47.49', 'Oklahoma', '$98,052', '$8,171', '$1,885', '$47.14', 'Mississippi', '$97,389', '$8,115', '$1,872', '$46.82', 'Michigan', '$96,564', '$8,047', '$1,857', '$46.43', 'Missouri', '$94,548', '$7,879', '$1,818', '$45.46', 'Texas', '$94,428', '$7,869', '$1,815', '$45.40', 'Georgia', '$93,556', '$7,796', '$1,799', '$44.98', 'Illinois', '$93,085', '$7,757', '$1,790', '$44.75', 'Louisiana', '$89,464', '$7,455', '$1,720', '$43.01', 'North Carolina', '$84,706', '$7,058', '$1,628', '$40.72']



You can also try

table= soup.find('table', {'class': 'salary_by_state_table'})


to get an organized table values

<table class="salary_by_state_table">
<thead>
<tr>
<th class="col1">State</th>
<th class="col2">Annual Salary</th>
<th class="col3">Monthly Pay</th>
<th class="col4">Weekly Pay</th>
<th class="col5">Hourly Wage</th>
</tr>
</thead>
<tbody>
<tr>
<td class="col1">New York</td>
<td class="col2">$133,172</td>
<td class="col3">$11,097</td>
<td class="col4">$2,561</td>
<td class="col5">$64.03</td>
</tr>
<tr>
<td class="col1">Idaho</td>
<td class="col2">$129,201</td>
<td class="col3">$10,766</td>
<td class="col4">$2,484</td>
<td class="col5">$62.12</td>
</tr>
<tr>
<td class="col1">California</td>
<td class="col2">$127,575</td>
<td class="col3">$10,631</td>
<td class="col4">$2,453</td>
<td class="col5">$61.33</td>
</tr>
<tr>
<td class="col1">New Hampshire</td>
<td class="col2">$124,679</td>
<td class="col3">$10,389</td>
<td class="col4">$2,397</td>
<td class="col5">$59.94</td>
</tr>

.......


From here you can convert the data to csv 
Website below could be useful to convert from table html to csv
https://www.convertcsv.com/html-table-to-csv.htm

State,Annual Salary,Monthly Pay,Weekly Pay,Hourly Wage
New York,"$133,172","$11,097","$2,561",$64.03
Idaho,"$129,201","$10,766","$2,484",$62.12
California,"$127,575","$10,631","$2,453",$61.33
New Hampshire,"$124,679","$10,389","$2,397",$59.94
Vermont,"$121,280","$10,106","$2,332",$58.31
Maine,"$120,390","$10,032","$2,315",$57.88
Massachusetts,"$118,391","$9,865","$2,276",$56.92
Hawaii,"$117,548","$9,795","$2,260",$56.51
Tennessee,"$116,638","$9,719","$2,243",$56.08
Nevada,"$116,564","$9,713","$2,241",$56.04
Wyoming,"$116,434","$9,702","$2,239",$55.98
Washington,"$116,253","$9,687","$2,235",$55.89
Arizona,"$116,252","$9,687","$2,235",$55.89
Connecticut,"$113,536","$9,461","$2,183",$54.58
Montana,"$113,325","$9,443","$2,179",$54.48
Rhode Island,"$112,851","$9,404","$2,170",$54.26
Indiana,"$112,813","$9,401","$2,169",$54.24
New Jersey,"$112,746","$9,395","$2,168",$54.20
Alaska,"$112,238","$9,353","$2,158",$53.96
Minnesota,"$112,151","$9,345","$2,156",$53.92
West Virginia,"$112,113","$9,342","$2,156",$53.90
Oregon,"$111,346","$9,278","$2,141",$53.53
Maryland,"$109,286","$9,107","$2,101",$52.54
North Dakota,"$109,256","$9,104","$2,101",$52.53
Pennsylvania,"$108,610","$9,050","$2,088",$52.22
Wisconsin,"$106,369","$8,864","$2,045",$51.14
Virginia,"$106,296","$8,858","$2,044",$51.10
Ohio,"$105,665","$8,805","$2,032",$50.80
Iowa,"$104,159","$8,679","$2,003",$50.08
Nebraska,"$103,247","$8,603","$1,985",$49.64
South Dakota,"$103,087","$8,590","$1,982",$49.56
Colorado,"$102,923","$8,576","$1,979",$49.48
Kentucky,"$102,736","$8,561","$1,975",$49.39
Delaware,"$102,461","$8,538","$1,970",$49.26
Utah,"$101,968","$8,497","$1,960",$49.02
Alabama,"$101,620","$8,468","$1,954",$48.86
New Mexico,"$101,284","$8,440","$1,947",$48.69
South Carolina,"$101,195","$8,432","$1,946",$48.65
Kansas,"$99,712","$8,309","$1,917",$47.94
Florida,"$99,276","$8,273","$1,909",$47.73
Arkansas,"$98,776","$8,231","$1,899",$47.49
Oklahoma,"$98,052","$8,171","$1,885",$47.14
Mississippi,"$97,389","$8,115","$1,872",$46.82
Michigan,"$96,564","$8,047","$1,857",$46.43
Missouri,"$94,548","$7,879","$1,818",$45.46
Texas,"$94,428","$7,869","$1,815",$45.40
Georgia,"$93,556","$7,796","$1,799",$44.98
Illinois,"$93,085","$7,757","$1,790",$44.75
Louisiana,"$89,464","$7,455","$1,720",$43.01
North Carolina,"$84,706","$7,058","$1,628",$40.72