Xpath Notes

It's been a while.

#Requirements

Been working on some data collection stuff lately. To locate elements in html page, xpath is very convenient. It's easy to learn as well.

Simply puts, regarding html page as an element tree, xpath is just like a map telling you what to do at each or major branches.

We know what the source code of a html page looks like, don't we?

To use xpath in python, module lxml is needed.

1
2
3
import requests
from lxml import etree
html_page = etree.HTML(requests.get(url).text)

Then simply using html_page.xpath(f"{xpath_expression}") would easily locate the corresponding elements.

#More about xpath

Yeah, to locate elements is just like that. However, xpath expression is a little more complex.

There's a very detailed tutorial on w3schools about xpath syntax. Basically everything you need to get familiar to xpath is in it.

What I wanna write about here is something I learnt during my element locating experience.

  1. To locating elements by name AND order is not available, at least for now. For example, to locate the text go, the xpath expression would be '//div[@class="main"]/div[5]/p/text()', which means the fifth div under main div. To express the third div under the main div with a class name of "h" is not possible.

    1
    2
    3
    4
    5
    6
    7
    <div class="main">
    <div class="h"><p>He</p></div>
    <div class="j"><p>and</p></div>
    <div class="J"><p>I</p></div>
    <div class="h"><p>will</p></div>
    <div class="h"><p>go</p></div>
    </div>

    Well, not that hard you may say. What if there're hundreds of div's under the main div. When you get to work with real html pages, it's very rare to find distinct snippets like above. So, find other ways to locate elements when you trying to do it by name and order at the same time.

  2. Most web browser has its direct ways to copy xpath. For example, in Microsoft Edge simply press F12 to enter devTools. Right click corresponding source code then there is a copy->copy xpath selection. However, if you use requests to get a page, the source code is A LITTLE different to what's shown in devTools. You can find a load of information on the internet about why there're differences.

    Maybe using POST method? I'll give it try later.

  3. No tutorial could ever teach you everything about it. Gonna have to try things when it comes to real problems. Sometimes you need to locate elements one by one even if they are perfectly aligned in html viewer like web browser. Be patient.

  4. In the case of new problems, of course.

Review on the Firsts in Life Manipulate Microsoft Excel with Python
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×