% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/html_df.R
\name{html_df}
\alias{html_df}
\title{Get a tabular summary of webpage content from a vector of urls}
\usage{
html_df(
  urlx,
  max_size = 5e+06,
  wait = 0,
  time_out = 10,
  show_progress = TRUE,
  keep_source = TRUE,
  chrome_bin = NULL
)
}
\arguments{
\item{urlx}{A character vector containing urls.  Local files must be prepended with \code{file://}.}

\item{max_size}{Maximum size in bytes of pages to attempt to parse, defaults to \code{5000000}.
This is to avoid reading very large pages that may cause \code{read_html()} to hang.}

\item{wait}{Time in seconds to wait between successive requests. Defaults to 0.}

\item{time_out}{Time in seconds to wait for \code{httr::GET()} to complete before exiting.  Defaults 
to 10.}

\item{show_progress}{Logical, defaults to \code{TRUE}. Whether to show progress during download.}

\item{keep_source}{Logical argument - whether or not to retain the contents of the page \code{source} 
column in the output tibble.  Useful to reduce memory usage when scraping many pages.  Defaults to \code{TRUE}.}

\item{chrome_bin}{(Optional) Path to a Chromium install to use Chrome in headless mode for scraping}
}
\value{
A tibble with columns 
\itemize{
\item \code{url} the original vector of urls provided
\item \code{title} the page title, if found
\item \code{lang} inferred page language
\item \code{url2} the fetched url, this may be different to the original, for example if redirected
\item \code{links} a list of tibbles of hyperlinks found in \code{<a>} tags
\item \code{rss} a list of embedded RSS feeds found on the page
\item \code{tables} a list of tables found on the page in descending order of size, coerced to
 \code{tibble} wherever possible.  
\item \code{images} list of tibbles containing image links found on the page
\item \code{social} list of tibbles containing twitter, linkedin and github user info found on page
\item \code{code_lang} numeric indicating inferred code language.  A negative values near -1 
indicates high likelihood that the language is python, positive values near 1 indicate R. 
If not code tags are detected, or the language could not be inferred, value is \code{NA}.
\item \code{size} the size of the downloaded page in bytes
\item \code{server} the page server
\item \code{accessed} datetime when the page was accessed
\item \code{published} page publication or last updated date, if detected 
\item \code{generator} the page generator, if found
\item \code{status} HTTP status code 
\item \code{source} character string of xml documents.  These can each be coerced to \code{xml_document}
for further processing using \code{rvest} using \code{xml2:read_html()}.
}
}
\description{
From a vector of urls, \code{html_df()} will attempt to fetch the html.  From the 
html, \code{html_df()} will attempt to look for a page title, rss feeds, images, embedded social media
profile handles and other page metadata.  Page language is inferred using the package \code{cld3}
which wraps Google's Compact Language Detector 3.
}
\examples{
# Examples require an internet connection...
urlx <- c("https://github.com/alastairrushworth/htmldf", 
          "https://alastairrushworth.github.io/")
dl   <- html_df(urlx)
# preview the dataframe
head(dl)
# social tags
dl$social
# page titles
dl$title
# page language
dl$lang
# rss feeds
dl$rss
# inferred code language
dl$code_lang
# print the page source
dl$source


}
\author{
Alastair Rushworth
}
