web-scraping – IT Nursery

Web scraping with Python [closed]

June 4, 2022 by IT Nursery

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations. Closed 2 years ago. Improve this question I’d like to grab daily sunrise/sunset … Read more

How to save an image locally using Python whose URL address I already know?

June 2, 2022 by IT Nursery

I know the URL of an image on Internet. e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google. Now, how can I download this image using Python without actually opening the URL in a browser and saving the file manually. 17 Answers 17

How can I efficiently parse HTML with Java?

May 31, 2022 by IT Nursery

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, … Read more

How can I pass variable into an evaluate function?

May 27, 2022 by IT Nursery

I’m trying to pass a variable into a page.evaluate() function in Puppeteer, but when I use the following very simplified example, the variable evalVar is undefined. I’m new to Puppeteer and can’t find any examples to build on, so I need help passing that variable into the page.evaluate() function so I can use it inside. … Read more

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

May 26, 2022 by IT Nursery

I’m practicing the code from ‘Web Scraping with Python’, and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html = urlopen(“http://en.wikipedia.org”+pageUrl) bsObj = BeautifulSoup(html) for link in bsObj.findAll(“a”, href=re.compile(“^(/wiki/)”)): if ‘href’ in link.attrs: if link.attrs[‘href’] not in pages: #We have … Read more

Web-scraping JavaScript page with Python

May 22, 2022 by IT Nursery

I’m trying to develop a simple web scraper. I want to extract text without the HTML code. It works on plain HTML, but not in some pages where JavaScript code adds text. For example, if some JavaScript code adds some text, I can’t see it, because when I call: response = urllib2.urlopen(request) I get the … Read more

How can I get the Google cache age of any URL or web page? [closed]

May 20, 2022 by IT Nursery

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it’s on-topic for Stack Overflow. Closed 4 years ago. Improve this question In my project I need the Google cache age to be added as important information. I tried to search … Read more

Headless Browser and scraping – solutions [closed]

May 10, 2022 by IT Nursery

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it’s on-topic for Stack Overflow. Closed 7 years ago. Improve this question I’m trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of … Read more

How to find elements by class

May 1, 2022 by IT Nursery

I’m having trouble parsing HTML elements with “class” attribute using Beautifulsoup. The code looks like this soup = BeautifulSoup(sdata) mydivs = soup.findAll(‘div’) for div in mydivs: if (div[“class”] == “stylelistrow”): print div I get an error on the same line “after” the script finishes. File “./beautifulcoding.py”, line 130, in getlanguage if (div[“class”] == “stylelistrow”): File … Read more

Which HTML Parser is the best?

April 7, 2022 by IT Nursery

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links = … Read more