It's been a while.
#Requirements
Been working on some data collection stuff lately. To locate elements in html page, xpath is very convenient. It's easy to learn as well.
Simply puts, regarding html page as an element tree, xpath is just like a map telling you what to do at each or major branches.
We know what the source code of a html page looks like, don't we?
To use xpath in python, module lxml
is needed.
1 | import requests |
Then simply using html_page.xpath(f"{xpath_expression}")
would easily locate the corresponding elements.
#More about xpath
Yeah, to locate elements is just like that. However, xpath expression is a little more complex.
There's a very detailed tutorial on w3schools about xpath syntax. Basically everything you need to get familiar to xpath is in it.
What I wanna write about here is something I learnt during my element locating experience.
To locating elements by name AND order is not available, at least for now. For example, to locate the text
go
, the xpath expression would be'//div[@class="main"]/div[5]/p/text()'
, which means the fifthdiv
under main div. To expressthe third div under the main div with a class name of "h"
is not possible.1
2
3
4
5
6
7<div class="main">
<div class="h"><p>He</p></div>
<div class="j"><p>and</p></div>
<div class="J"><p>I</p></div>
<div class="h"><p>will</p></div>
<div class="h"><p>go</p></div>
</div>Well, not that hard you may say. What if there're hundreds of
div
's under the main div. When you get to work with real html pages, it's very rare to find distinct snippets like above. So, find other ways to locate elements when you trying to do it by name and order at the same time.Most web browser has its direct ways to copy xpath. For example, in Microsoft Edge simply press
F12
to enter devTools. Right click corresponding source code then there is acopy
->copy xpath
selection. However, if you use requests toget
a page, the source code is A LITTLE different to what's shown in devTools. You can find a load of information on the internet about why there're differences.Maybe using POST method? I'll give it try later.
No tutorial could ever teach you everything about it. Gonna have to try things when it comes to real problems. Sometimes you need to locate elements one by one even if they are perfectly aligned in html viewer like web browser. Be patient.
In the case of new problems, of course.