Using the Characteristics of Documents, Users and Tasks to Predict the Situational Relevance of Health Web Documents

.


INTRODUCTION
It is estimated that 3.5 billion individuals (47.3% of the population) were Internet users in 2016 worldwide (World Telecommunication 2016).The number of users increased and so did the amount of information that has become available to the users in the past decades, including consumer-oriented health information.Consequently, the number of people being affected by such information has also increased.Studies have shown that people consider the Internet to be a credible source when seeking health-related information (Savolainen 2008, Kim 2009, Leite et al. 2016).The latest national survey reported that in 2012, among all adults in the U.S., 72% looked online for health information (Fox et al. 2013).Several user studies have been conducted with the aim to learn how people use online resources for their health concerns (Fox 2011, Espanha et al. 2008), and how internet users search for health information on the Web (Fox 2006, Fox et al. 2013).The goal of current research is to assess and improve relevance estimation of consumer-oriented health information on the Web.Search engines typically estimate relevance using document characteristics (Saracevic 1996), leaving out features from users and tasks that can be useful for relevance estimation.The objective of the present study is to analyze which characteristics influence the relevance of health web documents, with the help of an existing dataset, composed by annotated web pages, characteristics, users, tasks and relevance judgments.We aim to find good descriptors and potential predictors of situational relevance.This work is an extension of a previously published study (Oroszlányová et al. 2017).

BACKGROUND The Concept of Relevance in Information Retrieval
The notion of relevance has been studied for decades.Several information retrieval (IR) models have been developed to predict documents' relevance (e.g., the classical Boolean model, vector space model and probabilistic model).Generally, they consist of a framework including representations of documents, queries, relationships among them and, in some cases, a ranking function.IR models rely on evaluations which consider traditional user and task models.Such models are though inadequate, as for example they do not capture all types of informationseeking tasks, activities, and situations (Kelly et al. 2009).These models do not seem to be sufficient to approximate the relevance judgments (Bates 2011).It is important to consider the associated context, other than just document properties.A range of relevance models have been introduced and discussed, from Saracevic's stratified model of interaction levels (Saracevic 1996) till Borlund's model (Bates 2011).The stratified model is based on theoretical concepts of human-computer interaction (HCI), and the stratificational theory developed in linguistics.It considers the contemporary reality of IR, and the nature of relevance in information science, and optimizes the strengths, and minimizes the weaknesses of both the systems-centered and user-centered approaches to IR (Saracevic 1996).Borlund's model is based on an analytic approach, considering the temporal dimension (Bates 2011).Besides explicit relevance models (Vargas et al. 2012), multidimensional relevance modeling has been as well studied (Zhang et al. 2014).
Every search engine has to estimate the usefulness of the information accessed via web pages, referred to as the relevance of a document to a user (Saracevic 1996).As the retrieval of relevant information is the main concern of any IR system (Manning et al. 2009), there are several types of user-based relevance, depending on the context and on the user.In this study, we consider situational relevance (i.e., utility), expressed by the usefulness of the documents to the user task (Saracevic 1996).The aim of the estimation of situational relevance is that knowing how relevance depends on the user and document characteristics can bring insights on new features.For this reason, the concept of relevance, that is, the retrieval of relevant information, is central to information retrieval in all domains (Manning et al. 2009).In IR processes, the role of users is also an important factor in relevance assessment.Users evaluate web pages and decide about their utility for different information-seeking tasks, based on certain criteria.Such features include textual, structural and qualitative aspects, as well as non-textual items and physical properties of the web documents (Tombros et al. 2005).Research, topic, scope, data, influence, affiliation, web characteristics, and authority have been identified as key relevance criteria (Crystal et al. 2006), indicating the complexity of web users' relevance judgments, and are important in the design of IR systems.Other user-defined relevance criteria such as specificity, topicality, familiarity and variety are frequently used in relevance judgments (Savolainen et al. 2006).

Consumer Health Information Seeking
Since the 90's, when a guide to the Internet was introduced by Pallen (1995), the healthcare providers started to share information on medical and health topics with the public on the Web.More and more information had become available, and online health seekers started to look for the information not only for themselves, but also often for someone else (Fox et al. 2002, Fox 2006, Fox 2011, Espanha et al. 2008).Thus, the health search has an impact on people's health care routines.The reviewed studies of theoretical models of health IR (Marton et al. 2011) suggest the usefulness of multidisciplinary approaches and of conceptual models.A wild range of literature about IR evaluation has been reviewed, providing 'a baseline for the growth and maturation of the specialty' (Marton et al. 2011).This historical overview documents the evolution of the IR evaluation methods of 40 years, analyzing 127 selected articles, which the readers can use as a baseline bibliography of the area.
Based on the results of the evaluation of user-centered health information retrieval, the development of retrieval techniques for medical queries for lay users proved difficult (Goeuriot et al. 2014b).Related research on automatic generation of queries (Goeuriot et al. 2014a) explores new topic generation strategies, with the aim of generating queries that are representative of patients' information needs.Investigation on the effectiveness of search engines in retrieving information about medical symptoms has been conducted, focusing on designing systems which improve health search (Palotti et al. 2015).It resulted in the conclusion that query expansion is an important factor in improving search effectiveness.Further development of search technologies for consumer health search considers self-diagnosis information needs and needs related to treatment and management of health conditions (Zuccon et al. 2016).The relevance assessments were shown to be influenced by user, task, query and document characteristics (e.g., age, gender, health search experience, medical specialty, task clarity) (Lopes et al. 2010).A previous study showed that user and task characteristics are also good descriptors and possible predictors of relevance (Oroszlányová et al. 2015).In the present work, we want to predict the relevance of a document to a user, with the help of the available features (Lopes et al. 2010).

METHODOLOGY Datasets
The present study is based on an existing dataset composed by an annotated sample of 4533 health web documents.It was initially collected for a user study (Lopes et al. 2013), where the participants performed 8 tasks, associated with different health information seeking situations, based on questions submitted to the health category of the Yahoo!Answers service.From the list of open questions of this category, starting with the most popular one, 8 questions about treatments to a symptom/disease were selected.For each question 4 different search queries were defined, 2 in English and 2 in the participants' native language.In each language, the 2 queries were formulated by using lay and medico-scientific terminology, respectively.Queries were built concatenating the 8 symptoms or diseases (painful urination/dysuria; head itching/head pruritus; high uric acid/hyperuricaemia; mouth inflammation/stomatitis; bone infection/osteomyelitis; heartburn/pyrosis; hair loss/alopecia; joint pain/arthralgia) with the word treatment with different medical terminology (lay/medico-scientific).To reduce the risk of Google learning from the previous submitted queries, it was ensured that returned links were never clicked.Further, to prevent changes in the search engine, all queries were submitted within a very short time span.For each query, the top-30 results were collected.For these documents, a metadata scheme was defined and used for a latter annotation with manual and automatic approaches (Lopes et al. 2011).The documents were assessed by university students in terms of relevance and comprehension, using a 3-valued scale.To evaluate the quality of the annotation, 10% of the documents were also assessed by an external health professional (Sousa 2011).The agreement rate between both assessments was measured through Kappa de Cohen, where 38 indicators had concordance values greater than 0.8, 3 indicators had concordance values between 0.6 and 0.8, and 1 indicator had between 0.4 and 0.6.Thus, the way the characteristics were evaluated/annotated was, in general, well defined.Information about the users has been collected through questionnaires.The metadata scheme that was used to annotate the dataset contains specific characteristics of web documents, tasks and users, listed in Table 1.The document features were categorized according to its content (e.g.: is it readable?is it a scientific publication?), to its web characteristics (e.g.: articles, academic works), to the entity responsible for the website (e.g.: are there contacts of the author and web-master? is it of scientific nature?), and to the website (e.g.: its objective, domain or type).Task related characteristics include users' feedback on the tasks clarity, easiness and familiarity.User characteristics describe the user in terms of their age, English proficiency, health literacy and health search experience.
In the present work, situational relevance is assessed by a question where users were asked to evaluate the usefulness of each document in a 3-level scale (0 -non-relevant, 1 -partially relevant, 2 -totally relevant).The task characteristics contain the comprehension of the documents by the users, which has 3 assessment levels, as described in Table 1.

Statistical Analysis
In the section Multivariate Analysis of Situational Relevance, we analyze how multiple variables from our data collection relate with relevance.We build a prediction model with the aim to foresee the relevance of a document based on its characteristics, as well as those for users and tasks.With this goal in mind, we first select the variables that build up a model that best fits our data.To do so, we use the least absolute shrinkage and selection operator (lasso), which selects the best subset of predictors by shrinking the regression coefficients towards zero, and estimates the coefficients (James et al. 2013).It is based on logistic regression, which models the probability of documents' relevance given their characteristics, as well as those for users and tasks.We can write it as Probability (relevance = yes|characteristics), where the probability values p(characteristics) range between 0 and 1. Originally, our model had a multinomial distribution with three relevance levels (0, 1 and 2).Here we merge relevance levels 1 and 2, inducing a binomial distribution of the model.
After the lasso variable selection, we include the chosen characteristics in the multiple logistic regression model, and estimate its accuracy using leave-one-out cross-validation (LOOCV).The LOOCV error rate in our classification setting is estimated by averaging the misclassified observations.The LOOCV approach splits the set of observations into a single observation, used for the validation set, and the remaining observations which form the training set, where the prediction is made for the former observation.

Continuous
The possible impact of information on the user, e.g., the use of "positive" or "negative" expressions (character of the information) 0-Negative 1-Neutral 2-Positive Existence of "real" cases given by specialists (clinical cases) 0-Not present 1-Present Whether the content is divided into several pages in case of html formats (split content) 0-Not present 1-Present Language of the content (annotated according to ISO 639-1 (e.

MULTIVARIATE ANALYSIS OF SITUATIONAL RELEVANCE
In this section, we describe how we build the models, the models and their evaluation.We build a second model, called reduced model, that contains only the significant variables from the full model, and compare the LOOCV estimates of prediction (or test) errors for the two models.We decided to build the reduced model to analyze whether we could reach similar results using a lower number of features, what would ease the process of relevance estimation.

Full Model
Our first model considers all variables.We start our analysis by fitting a lasso model on the training set.Using cross-validation we then choose the "best" tuning parameter, and use it to fit the lasso model on the full dataset (Model definition process).With the variables selected by the lasso model, we fit a multiple logistic regression model (Logistic regression model), and evaluate the results (Evaluation).

Model definition process
Applying the lasso to and using potential predictor variables discussed in the section Datasets, we built a model predicting the relevance of web documents.The lasso, with the minimal tuning parameter chosen by cross-validation, yielded a prediction model containing candidate variables to be analyzed with the multiple logistic regression model.  1 (Never) to 5 (Frequently) Usage of medico-scientific terminology during Web searches about health subjects 1 (Never) to 5 (Always) Level of satisfaction of the users' health information need on web pages, blogs, forums, social networks, chats, newsletter and RSS feeds 1 (Never) to 5 (Frequently)

Logistic regression model
The lasso helped in variable selection, and we continued the analysis with model selection using logistic regression.The resulting variables from the lasso model were added to the multiple logistic regression model which is summarized in Table 2.The letters D, U and T in the first column identify the feature as pertaining to the document, user or task, respectively.In the second column, we list the variables.The numbers in the parentheses indicate the levels of the variables (according to the scales defined in Table 1).The continuous variables, naturally, do not have such indications, nor the dichotomous (binary) variables.The latter are the ones scaled with 1 in Table 1.The third column lists the variables' corresponding estimated coefficients.The fourth column contains the standard error when assessing the accuracy of the coefficient estimates.The fifth column contains the z-statistic, where a large (absolute) value indicates evidence against the null hypothesis of the coefficients being equal to zero.The last column lists the corresponding p-values.

Evaluation
Our regression model was verified by leave-one-out cross-validation, and its results are reported in the last row of Table 2.The p-values associated with the variables, marked with bold in Table 2., are statistically significant at α = 0.05.The negative coefficients indicate that documents with the corresponding variables are less likely to be relevant than the documents without these characteristics, for fixed values of the remaining variables.Variables with large coefficient estimates highlight the importance of such variables (e.g.comprehension) for relevance.To assess the accuracy of the model, we have fitted the model using half of the data (training dataset), and then examined how well it predicts the held out data (test dataset) (James et al. 2013).Using the test dataset we then computed the probabilities of the document being relevant, allowing us to compute the accuracy, sensitivity and specificity of the model.Given these predictions, we determined how many observations were correctly or incorrectly classified.Our logistic regression has an accuracy of 77.17%, a specificity (true negative rate) of 68.01% and sensitivity (true positive rate) of 78.98%.The LOOCV estimate of prediction error from Table 2. is low (15.73%),meaning that the regression model is of high accuracy.

Reduced Model
We built a second model, including only the statistically significant variables from the full model.In this second model, all variables remained significant except the one pertaining to the third level of task familiarity.Table 3. shows the coefficient estimates for a logistic regression model that uses the selected 30 variables to predict the probability of a document being relevant or not relevant for the user.We assessed the model's accuracy using leave-one-out cross-validation, with an estimated prediction error of 0.1585.Our logistic regression has an accuracy of 77.53%, a specificity of 70.85% and sensitivity of 78.72%.As expected, the LOOCV estimate of prediction error for this model is slightly higher than the one for the full regression model in Table 2.

DISCUSSION
In the section Evaluation and Comparison of the Models, we compare the models in terms of number of variables and evaluation rates.In the section Characteristics' Pertinence for Assessing the Situational Relevance of Health Content, we discuss the most important variables that contribute positively or negatively to the prediction of quality, and how can they be automatically assessed.

Evaluation and Comparison of the Models
As expected, the best model to predict documents' relevance is the one that contains all variables suggested by lasso.However, the reduced model was very close in terms of error rates and has the advantage of not requiring so much information.In Table 4, we summarize the evaluation metrics of the full and reduced logistic regression models.The first row contains the number of variables included in each model.In the second row we can see that the full model has the lowest prediction error estimate (LOOCV error).
The slightly higher value of sensitivity in the full model supports this finding as well.However, its accuracy and specificity, indicated in the third and fourth row, are slightly lower than the one of the reduced model.This implies that the reduced model with higher accuracy and specificity is better at excluding the non-relevant documents, what may be preferable in a retrieval system.We note that the unbalanced data regarding the proportion of relevant documents in the dataset might affect accuracy and yield a very optimistic estimate, what is a common phenomenon in binary classification.Since some of the features were annotated with manual approaches, it might be more difficult to automatically predict them.On the other hand, features annotated with automatic approaches might be easier to predict automatically.

Characteristics' Pertinence for Assessing the Situational Relevance of Health Content
The studied models are useful to understand which characteristics are more relevant to estimate the situational relevance of web health documents.The ones that significantly contribute to the prediction of situational relevance, either positively or negatively, might be important for this.For example, search engines might use this information to improve their performance.
The analysis of these models allows us to identify important features to estimate relevance.Documents containing links to other sites were found to be useful to relevance prediction.On the other hand, the variables related to the rank of the document and to documents with content divided into several pages were associated with negative estimators, indicating that the relation is the other way around.The presence of information about a treatment, and medical terminology understandable by the user, also contribute to the document being relevant.However, the presence of information about clinical cases given by specialists was found to contribute negatively to relevance, as well as documents from the domain '.es ' and '.br' (i.e., the Internet country code top-level domains for Spain and Brazil).Documents from collaborative websites, Chilean web domains ('.cl'), which contain the name of webmaster, and which were recently updated were, as well, shown to be useful features to predict relevance.Users seem to value the use of some media, e.g.flash documents in .SWF format, but not content including audio files.In case of SWF format, Chilean web domain and search for health information in newsletters, the reason of such findings might be related to the number of documents.
Besides the above document characteristics, we found that several user health search habits help in estimating the relevance of documents.Users who feel successful in web search, and who frequently conduct health search in English or Portuguese language (which are the languages of the queries in the dataset), were shown to assess documents as relevant more often.Users' proficiency in English language was shown to contribute negatively to relevance, as well as frequent search on web pages and newsletters.The advanced comprehension level of the documents by the users was shown to highly influence the prediction of its relevance.The clarity of the tasks was also found to contribute positively to relevance, while the familiarity of users with the tasks showed negative contribution.More experienced users might be more demanding, what is inline with previous findings (Lopes et al. 2010, Saracevic 1996).
Regardless the high values of estimates, some of the features included in the model might be less rather being just a reflection of the dataset (e.g.: there were only a few documents in SWF format (0.14%)).As well, in case of variables with multiple levels it is useful to consider only one level at once.For instance, we might prefer the second level to the first one for the variable Comprehension, because its estimate is higher or because we want to make predictions for documents which are completely understood by the users.
These results are aligned with previous findings (Oroszlányová et al. 2017).

CONCLUSIONS
We conducted a multivariate analysis focused on whether the characteristics of tasks, users and documents are useful to predict document relevance, and how.For this purpose, we built two regression models.Our best model had the following evaluation metrics: the LOOCV estimate of prediction error for the full model which considered all variables suggested by lasso (15.73%); sensitivity for the full model including all variables (78.98%).Accuracy was almost equal in the full and reduced models (77.17% vs. 77.53%);and specificity was slightly higher for the reduced model (68.01%vs. 70.85%).The model with higher accuracy and specificity is best at excluding the nonrelevant documents, which may be preferable in some retrieval systems.
Among the features which were identified to predict relevance, we found several characteristics related to the user and tasks.These mainly relate to the users' health search habits, their and the tasks' clearness.Our models consider characteristics that might be difficult to automatically identify (e.g., prevention, prognosis or treatment), some of them will be easy to identify automatically (like the ones we already automatically assess) and, to several of them, we envision ways to automatically detect them (e.g., copyrights, images, video, type).They can be useful to improve the estimation of relevance by search engines, particularly of health documents on the Web.Therefore, in the future we will work on the development of methods to automatically detect these features.The application of these models to other datasets might be also interesting, allowing the generalization of our results.This might be important because there are features that are only present in a small number of documents what may be interfering with the model.Another future study might consider incorporating some of the features (e.g.considering users' understandability of the documents) to improve the performance of search engines.
Last update date, annotated according to ISO 8601 (YYYY-MM) and with "0" if it did not exist) Nominal Indication of sources (references) Continuous Parallel interests (commercial intent, advertisements) 0-Not present 1-Present Terminology (specific vocabulary) 1-Little understandable 2-Understandable 3-Completely understandable Type of the content (audio, image, text, video) 0-Not present 1-Present Electronic format of the document (e.g.: html, pdf) Nominal Number of pages of the document Continuous Documents from a publication of scientific character (e.g.: scientific papers) 0-Not present 1-Present Type of medical information contained in the document (epidemiologic data, pathologic definition, diagnosis, indication of health professionals, place of treatment, prevention, prognosis, treatment) 0-Not present 1-Present Links to other sites/internal pages of the URLs 0-Not present 1-Present Documents -Web Documents Main type of the content (Article, informative, message, questionnaire, comment, academic work) Nominal Rank of the documents chosen by the users 0.g.: .com,.gov,.edu)Nominal Type (collaborative, personal institutional-scientific, institutional-not scientific, electronic commerce) Nominal Disclosure (copyrights, privacy policy) 0-Not present 1-Present Editorial review (team of revision, process of revision) 0-Not present 1-Present The user had an exact idea about the information in the tasks 1 (Disagree) to 5 (Agree) Level of clarity, easiness and familiarity of the tasks for the users 1 (Unclear/Easy/Unfamiliar) to 5 (Clear/Complex/Familiar) Whether the user succeeded in the task (task completion status) 1 (Unsuccessful) to 5 (Successful) Whether the users knew the technical terms 0the users Continuous Health literacy of the users Continuous Number of medical concepts included in the query, that the user knows Continuous Age of the users Continuous Gender of the users Nominal Health status of the users 1 (Not healthy) to 5 (Very healthy) Experience of the users with Web search and with health search 0often Success of the users with Web search and health search 1 (Never) to 5 (Always) Health search in Portuguese, English and other language

Table 1 .
Description of the Documents, Task and User Characteristics

Table 1 (continued).
Description of the Documents, Task and User Characteristics

Table 2 .
Summary of the coeffcient estimates in the full model

Table 3 .
Summary of the coefficient estimates in the reduced multiple logistic regression model

Table 4 .
Comparison of the full and reduced logistic regression models in terms of number of variables and evaluation rates