需要使用python清理网页抓取的数据
•浏览 1
Need to clean web scraped data using python
我正在尝试编写用于从 http://goldpricez.com/gold/history/lkr/years-3 抓取数据的代码。我写的代码如下。该代码有效,并给了我预期的结果。
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
import requests
from bs4 import BeautifulSoup
url ="http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text,"html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip()," ", data[i+1].text.strip())
但结果是一些不需要的数据,我只想要表中的数据。请帮我解决这个问题。
这里我添加了带有不需要数据的输出图像(红色圆圈)
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
import requests
from bs4 import BeautifulSoup
url ="http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text,"html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip()," ", data[i+1].text.strip())
您使用 .read_html 的方式将返回所有表的列表。您的表位于索引 3
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
import requests
from bs4 import BeautifulSoup
url ="http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text,"html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip()," ", data[i+1].text.strip())
.read_html 调用 URL,并在后台使用 BeautifulSoup 解析响应。您可以像在 .read_csv 中那样更改解析、表的名称、传递标头。查看 .read_html 了解更多详情。
为了速度,你可以使用 lxml 例如pd.read_html(url, flavor='lxml')[3]。默认情况下,使用第二慢的 html5lib。另一种风格是 html.parser。这是它们中最慢的。
为此使用 BeautifulSoup,下面的代码可以完美运行
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
import pandas as pd
url ="http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
import requests
from bs4 import BeautifulSoup
url ="http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text,"html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip()," ", data[i+1].text.strip())
使用 BeautifulSoup 的另一个优点是它比你的代码更快