There are multiple options available to scrape or extract data from web sites. These utilities can also be used to test or unit test web projects. Web scraping sites that require javascript support You could certainly write a XUL application with Mozilla (run it with Firefox, Xulrunner etc) which scripts a web browser. Javascript is normally used for such tasks. What I've found is tricky is suppressing all the kinds of dialogue boxes which the browser would otherwise create - you effectively have to override the behaviour of the XPCOM server classes which are invoked for each type of dialogue, and there are a lot of different ones (for example, if your site decides to redirect to a https site with an expired certificate). Of course you should NOT use such a mechanism to violate any site's policy on use by robots. Normally you should never submit a form with a robot. XUL: XML/Javascript markup to design a UI to build a cross platform project such as Firefox XULRunner is a runtime environment developed by the Mozilla Foundation to provide a common back-end for XUL-based applications. It replaced the Gecko Runtime Environment, a stalled project with a similar purpose. Screen Scraping from a web page with a lot of Javascript Screen scraping through AJAX and javascript How do I implement a screen scraper in PHP? What's a good tool to screen-scrape with Javascript support? Are there command line or library tools for rendering webpages that use JavaScript? command line URL fetch with JavaScript capabliity $ pip install --user selenium $ pip install --user nltk # no longer supports html cleanup $ pip install --user beautifulsoup4 $ python from selenium import webdriver import time driver = webdriver.Chrome() driver.get('https://t.co/lw242oZvUz') #time.sleep(5) htmlSource = driver.page_source from bs4 import BeautifulSoup soup = BeautifulSoup(htmlSource, 'html.parser') str = soup.title.string print(str) str = soup.get_text() print(str.encode('utf-8')) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text.encode('utf-8')) |