{"id":21161,"date":"2023-02-14T01:16:11","date_gmt":"2023-02-14T01:16:11","guid":{"rendered":"https:\/\/www.booksofall.com\/in\/?post_type=product&#038;p=21161"},"modified":"2023-02-14T01:16:12","modified_gmt":"2023-02-14T01:16:12","slug":"the-java-web-scraping-handbook","status":"publish","type":"product","link":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/","title":{"rendered":"The Java Web Scraping Handbook"},"content":{"rendered":"<p>Introduction to Web scraping<\/p>\n<p>Web scraping or crawling is the act of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be done manually, but generally this term refers to the automated process of downloading the HTML content of a page, parsing\/extracting the data, and saving it into a database for further analysis or use.<\/p>\n<p>Web fundamentals<\/p>\n<p>The internet is really complex : there are many underlying techologies and concepts involved to view a simple web page in your browser. I don\u2019t have the pretention to explain everything, but I will show you the most important things you have to understand to extract data from the web.<\/p>\n<p>HyperText Transfer Protocol<\/p>\n<p>From Wikipedia :<\/p>\n<p>The Hypertext Transfer Protocol (HTTP) \u00a0is an\u00a0<a title=\"Application layer\" href=\"https:\/\/en.wikipedia.org\/wiki\/Application_layer\">application layer<\/a>\u00a0protocol in the\u00a0<a title=\"Internet protocol suite\" href=\"https:\/\/en.wikipedia.org\/wiki\/Internet_protocol_suite\">Internet protocol suite<\/a>\u00a0model for distributed, collaborative,\u00a0<a title=\"Hypermedia\" href=\"https:\/\/en.wikipedia.org\/wiki\/Hypermedia\">hypermedia<\/a> information systems. HTTP is the foundation of data communication for the\u00a0<a title=\"World Wide Web\" href=\"https:\/\/en.wikipedia.org\/wiki\/World_Wide_Web\">World Wide Web<\/a>, where\u00a0<a title=\"Hypertext\" href=\"https:\/\/en.wikipedia.org\/wiki\/Hypertext\">hypertext<\/a>\u00a0documents include\u00a0<a title=\"Hyperlink\" href=\"https:\/\/en.wikipedia.org\/wiki\/Hyperlink\">hyperlinks<\/a>\u00a0to other resources that the user can easily access, for example by a\u00a0<a title=\"Computer mouse\" href=\"https:\/\/en.wikipedia.org\/wiki\/Computer_mouse\">mouse<\/a>\u00a0click or by tapping the screen in a web browser. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text. HTTP is the protocol to exchange or transfer hypertext.<\/p>\n<p>So basically, as in many network protocols, HTTP uses a client\/server model, where an HTTP client (A browser, your Java program, curl, wget\u2026) opens a connection and sends a message (\u201cI want to see that page : \/product\u201d)to an HTTP server (Nginx, Apache\u2026). Then the server answers with a response (The HTML code for exemple) and closes the connection. HTTP is called a stateless protocol, because each transaction (request\/response) is independant. FTP for example, is stateful.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p><iframe frameborder=\"0\" allowtransparency=\"true\" allowFullscreen=\"true\" style=\"width: 100%; height: 700px; border: none;\" src=\"https:\/\/online.visual-paradigm.com\/share\/book\/webscrapinghandbook-199hhestfu?enforceShowPromotionBar=true&#038;p=1\"><\/iframe><\/p>\n","protected":false},"featured_media":21165,"template":"","meta":{"_yoast_wpseo_title":"","_yoast_wpseo_metadesc":""},"product_brand":[],"product_cat":[276],"product_tag":[],"class_list":{"0":"post-21161","1":"product","2":"type-product","3":"status-publish","4":"has-post-thumbnail","6":"product_cat-java","8":"first","9":"instock","10":"shipping-taxable","11":"product-type-simple"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>The Java Web Scraping Handbook - BooksOffAll Indian<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/\" \/>\n<meta property=\"og:locale\" content=\"hi_IN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Java Web Scraping Handbook - BooksOffAll Indian\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/\" \/>\n<meta property=\"og:site_name\" content=\"BooksOffAll Indian\" \/>\n<meta property=\"article:modified_time\" content=\"2023-02-14T01:16:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png\" \/>\n<meta name=\"twitter:label1\" content=\"\u0905\u0928\u0941\u092e\u093e\u0928\u093f\u0924 \u092a\u0922\u093c\u0928\u0947 \u0915\u093e \u0938\u092e\u092f\" \/>\n\t<meta name=\"twitter:data1\" content=\"1 \u092e\u093f\u0928\u091f\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/\",\"url\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/\",\"name\":\"The Java Web Scraping Handbook - BooksOffAll Indian\",\"isPartOf\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png\",\"datePublished\":\"2023-02-14T01:16:11+00:00\",\"dateModified\":\"2023-02-14T01:16:12+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#breadcrumb\"},\"inLanguage\":\"hi-IN\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"hi-IN\",\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage\",\"url\":\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png\",\"contentUrl\":\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png\",\"width\":\"585\",\"height\":\"781\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.booksofall.com\/in\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Categories\",\"item\":\"https:\/\/www.booksofall.com\/in\/categories\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"The Java Web Scraping Handbook\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.booksofall.com\/in\/#website\",\"url\":\"https:\/\/www.booksofall.com\/in\/\",\"name\":\"BooksOffAll Indian\",\"description\":\"Biggest IT eBooks library and learning resources - Free eBooks for programming, computing, artificial intelligence and more.\",\"publisher\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.booksofall.com\/in\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"hi-IN\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.booksofall.com\/in\/#organization\",\"name\":\"BooksOffAll Indian\",\"url\":\"https:\/\/www.booksofall.com\/in\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"hi-IN\",\"@id\":\"https:\/\/www.booksofall.com\/in\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2022\/06\/booksofall-logo-2.png\",\"contentUrl\":\"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2022\/06\/booksofall-logo-2.png\",\"width\":166,\"height\":30,\"caption\":\"BooksOffAll Indian\"},\"image\":{\"@id\":\"https:\/\/www.booksofall.com\/in\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Java Web Scraping Handbook - BooksOffAll Indian","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/","og_locale":"hi_IN","og_type":"article","og_title":"The Java Web Scraping Handbook - BooksOffAll Indian","og_url":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/","og_site_name":"BooksOffAll Indian","article_modified_time":"2023-02-14T01:16:12+00:00","og_image":[{"url":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png","type":"","width":"","height":""}],"twitter_card":"summary_large_image","twitter_image":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png","twitter_misc":{"\u0905\u0928\u0941\u092e\u093e\u0928\u093f\u0924 \u092a\u0922\u093c\u0928\u0947 \u0915\u093e \u0938\u092e\u092f":"1 \u092e\u093f\u0928\u091f"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/","url":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/","name":"The Java Web Scraping Handbook - BooksOffAll Indian","isPartOf":{"@id":"https:\/\/www.booksofall.com\/in\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage"},"image":{"@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage"},"thumbnailUrl":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png","datePublished":"2023-02-14T01:16:11+00:00","dateModified":"2023-02-14T01:16:12+00:00","breadcrumb":{"@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#breadcrumb"},"inLanguage":"hi-IN","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/"]}]},{"@type":"ImageObject","inLanguage":"hi-IN","@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#primaryimage","url":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png","contentUrl":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2023\/02\/img_63eadf02b962a.png","width":"585","height":"781"},{"@type":"BreadcrumbList","@id":"https:\/\/www.booksofall.com\/in\/the-java-web-scraping-handbook\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.booksofall.com\/in\/"},{"@type":"ListItem","position":2,"name":"Categories","item":"https:\/\/www.booksofall.com\/in\/categories\/"},{"@type":"ListItem","position":3,"name":"The Java Web Scraping Handbook"}]},{"@type":"WebSite","@id":"https:\/\/www.booksofall.com\/in\/#website","url":"https:\/\/www.booksofall.com\/in\/","name":"BooksOffAll Indian","description":"Biggest IT eBooks library and learning resources - Free eBooks for programming, computing, artificial intelligence and more.","publisher":{"@id":"https:\/\/www.booksofall.com\/in\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.booksofall.com\/in\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"hi-IN"},{"@type":"Organization","@id":"https:\/\/www.booksofall.com\/in\/#organization","name":"BooksOffAll Indian","url":"https:\/\/www.booksofall.com\/in\/","logo":{"@type":"ImageObject","inLanguage":"hi-IN","@id":"https:\/\/www.booksofall.com\/in\/#\/schema\/logo\/image\/","url":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2022\/06\/booksofall-logo-2.png","contentUrl":"https:\/\/www.booksofall.com\/in\/wp-content\/uploads\/sites\/13\/2022\/06\/booksofall-logo-2.png","width":166,"height":30,"caption":"BooksOffAll Indian"},"image":{"@id":"https:\/\/www.booksofall.com\/in\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/product\/21161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/product"}],"about":[{"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/types\/product"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/media\/21165"}],"wp:attachment":[{"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/media?parent=21161"}],"wp:term":[{"taxonomy":"product_brand","embeddable":true,"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/product_brand?post=21161"},{"taxonomy":"product_cat","embeddable":true,"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/product_cat?post=21161"},{"taxonomy":"product_tag","embeddable":true,"href":"https:\/\/www.booksofall.com\/in\/wp-json\/wp\/v2\/product_tag?post=21161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}