WEB SCRAPING IN PYTHON
Web scraping (web harvesting or web data extraction) is used for from Web scraping software may access the World Wide Web directly using the or through a web browser. While a software user can do web scraping manually, the term typically refers to automated processes implemented using It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for laterWeb scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.
Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.
source : Wiki
HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website to extract details from the web page .
Click here to get the source code from git
from bs4 import BeautifulSoup as webscraper
import requests
|
import re as regexInstance
|
responseInstance = requests.get("https://en.wikipedia.org/wiki/World_of_A_Song_of_Ice_and_Fire")
|
data
=
responseInstance.text
|
parseData
=
webscraper(data,"lxml")
|
#get
the type of the parsed data
|
type(parseData)
|
#Find
the lenth of the contends
|
len(parseData.contents)
|
#prettify
the parsed HTML from the page and print it
|
print(parseData.prettify())
|
#Filters
to travers the DOM Element
|
print(parseData.find("span", { "id" : "Maps" }).parent,"\n")
|
print(parseData.find("span", { "id" : "Maps" }).contents,"\n")
|
print(parseData.find("span", { "id" : "Maps" }).descendants,"\n")
|
print(parseData.find("span", { "id" : "Maps" }).next_sibling,"\n")
|
print(parseData.find("span", { "id" : "Maps" }).prev_sibling,"\n")
|
#find
all the span elements which has the class "mw-headline" and print
the text
|
for link in parseData.findAll("span", { "class" : "mw-headline" }):
|
print(link.text)
|
#find
all the elements with Id maps
|
print(parseData.find_all(id="Maps"))
|
#
Get the title text
|
print(parseData.head)
|
print(parseData.body.b)
|
print(parseData.title.text)
|
print(parseData.title.name)
|
print(parseData.title.string) #A
string corresponds to a bit of text within a tag.
|
print(parseData.title.parent.name)
|
print(parseData.p)
|
print(parseData.a)
|
print(parseData.get_text()) #
Extract the text from the HTML Element
|
#Retrieve
the contends inside a head tag
|
headTag
=
parseData.head
|
print(headTag.contents) #display
the contends inside the head tag
|
#
Using regular Expression to retrive elements
|
for tag in parseData.find_all(regexInstance.compile("^b")):
|
print(tag.name)
|
def classWithNoId(element):
|
return element.has_attr('class') and not tag.element('id')
|
print(parseData.find_all(classWithNoId))
|
#using
select to find the element
|
print(parseData.select("body
h1"))
|
print(parseData.select("html
head title"))
|
print(parseData.select("div:nth-of-type(3)"))
|
print(parseData.select("body
> h1"))
|
#find
the siblings of the tags
|
print(parseData.select("#content
~ .mw-body"))
|
print(parseData.select("#siteNotice"))
|
print(parseData.select("div#siteNotice"))
|
No comments:
Post a Comment