Identifying User Profiles from Statistical Grouping Methods

This research aimed to group users into subgroups according to their levels of knowledge about technology. Statistical hierarchical and non-hierarchical clustering methods were studied, compared and used in the creations of the subgroups from the similarities of the skill levels with these users’ technology. The research sample consisted of teachers who answered online questionnaires about their skills with the use of software and hardware with educational bias. The statistical methods of grouping were performed and showed the possibilities of groupings of the users. The analyses of these groups allowed to identify the common characteristics among the individuals of each subgroup. Therefore, it was possible to define two subgroups of users, one with skill in technology and another with skill with technology, so that the partial results of the research showed two main algorithms for grouping with 92% similarity in the formation of groups of users with skill with technology and the other with little skill, confirming the accuracy of the techniques of discrimination against


INTRODUCTION
Data collection is an important step in the process of scientific research, which requires selection of the sample of representative form, the selection of appropriate methods, techniques and organization of these data to be analyzed later.
Sampling is one of the first steps of data collection to be set, in order to avoid the bias of the search results and ensure the representation of the populations surveyed, especially, when it is necessary to ensure quotas or percentages by subgroups contained in the sample. The objective of this investigation was to identify statistical methods able to survey individuals group into subgroups with similar features on their skills with technology.
The sample included the participation of teachers who responded to the semi-structured and online survey (Flick, 2013) about the experiences of use of each on aspects and technological tools. The data were organized into tables so that they were run from cluster analysis algorithms (Mingoti, 2005) to assist the definition of groups of users with ability in technology and another without.
In this way, were applied statistical methods to be used for grouping of individuals in the sample, according to the responses submitted to the survey as well as was possible to group individuals into subgroups according to their characteristics.
The results obtained in this research analysis show a likeness of 92% between subgroups formed from the implementation of the algorithms of grouping models with type Ward and K-means. Also that the methods used in this research may cooperate in future investigations that seek to also classify users, define subgroups as features.

THE USERS PROFILE
An important step in the development of any product is to meet the wishes and the needs of the users. This also occurs with the information and communication technology (ICT), so the area of human-computer interaction (HCI) is dedicated to the study and development of more efficient methods and techniques on capturing such data of these users.
The user profile is the individualized description of the characteristics of users (Barbosa and Silva, 2010). Baxter, Courage and Caine (2015) in line with the concept presented and still ensures that the purpose of raising the profile of the user represents really know better for anyone who is developing the product and who will choose to research, validation, satisfaction and other. Rogers, Sharp and Preece (2013) report that the characteristics of the users should cover the main attributes of the intended user group, highlight the relevant skills and abilities of the user, and even cites some attributes to be considered in the survey of profile: nationality, education, preferences, personal circumstances, physical or mental handicaps and other.
The researches of Courage and Baxter (2005), Hackos and Redish (1998) and Peffer and Renken (2015) describe some types of data for better clarification of the user profile and to be collected for better definition of the domain of the product and the user interface with the technology: demographics, experience in the position he holds, company information, degree of education, experience with computers, experience with specific product or similar tools, available technology, training, attitudes and values, domain knowledge, goals, tasks, severity of errors, motivation to work, languages and jargon.
Highlights the importance of identifying the level of user experience on that if you want to investigate (beginner, expert, casual user or frequent user) as it affects especially the forms of interactions to be designed. This shows the importance of the definition of subgroups of the sample from statistical analysis of the data provided by users in order to truly implement products and their validation mechanisms in accordance with the intended target audience (Rogers et al., 2013).
The researches (Courage and Baxter, 2005) and (Hackos and Redish, 1998) confirm such a point of view by saying that the user profile helps meet to whom the product is being built, as well as collaborate in choosing participants for future activities of analysis and product evaluation.
The data of the users, so they can be collected from interviews and questionnaires. This data will add the values to the groups and tracks which fit together, in order to draw the profiles of users with similar characteristics and set the proportion of users that fit in each profile. It is important to highlight the possibility to prioritize certain features of a user profile, as the product or project in question ( Barbosa and Silva, 2010).
In the case of interface design process, users should be identified and characterised from the analysis and modeling of users with the following aspects, according to Oliveira Netto (2010): role or function specific to each user, familiarity with computer, level of knowledge of the field of application, frequency of use of the application and sociocultural context. Lee (1993) suggests that the analysis of the users can be divided into five steps: • identification of critical analysis and central factors for implementation; • Explore other critical factors for implementation; • Estimate the distribution of users for each factor; • Identify major groups of users; • Analyze the collective involvement of the distribution of users. It is still possible the inclusion of subjective factors on the last item of the classification, it is apparent that the distribution is not a factor. Oliveira Netto (2010) recommends that the questionnaires for analysis of users take into account user's favorite graphical environment (Windows, Linux or MAC OS), frequency of use (occasional, frequent or enumerated amount of times for a period of time) and level of familiarity or expertise in the field of application. In other words, the knowledge of how to perform the same tasks without the aid of the computer (beginner, intermediate or expert).
Thus, this research will prioritize the user experience, while the polling questionnaire also will seek to identify the attitudes of users who can confirm their experiences.

GROUPING METHODS
In this section are presented the main statistical methods, especially, hierarchical types group and nonhierarchical. It is natural to wish to qualify individuals or elements according to a pattern of similarity. The classification is obtained in order to make decisions appropriate for each group in particular, optimizing and directing consistent actions according to the need of the elements that make up each cluster. The assignment of the elements to the groups can be held so subjective, suffering the interference responsible for discriminating against the elements. However, the classification is done impartially, free of human intervention.
A difficulty inherent in the sorting process is associated with the amount of variables under study. The more variables, the more bureaucracy. For both, there are quantitative methods responsible for taking into account the information multivariate sampling unit. To group the elements, analysis decomposes in two instances, the methods of hierarchical and non-hierarchical groups, including details on RENCHER, 2002).

Hierarchical Methods
Consider n elements measured in p variables. The formation of groups is associated with responsible for metrics to quantify the similarity or dissimilarity between the observations. The measure most often used in the literature is Euclidean distance, denoted by a quadratic matrix containing 2 to 2 distances between all submissions, number of elements. Each element of , is obtained by (1) where and and are vectors of dimension regarding elements and . The Euclidean distance was employed in the study. This and other measures are presented in (Hardle and Simar, 2007). The distance is used as a decision criterion in an algorithm whose code is shown in Algorithm 1, which allocates an element to a group to each iteration through a connection method. In this study, the main methods are used: Single, Complete, Average and Ward. The connection methods are discussed in detail in (Johnson and Wichern, 2007).
Algorithm 1: Grouping based on hierarchical method Input: Data with measured objects in variables Output: Matrix of distances among the objects considering all the variables 1. For = 1 → then 2.
Allocate selected object through the method of connection in a set 4.
Calculate new matrix of distances ( − ) ( − ) 5. End The Single method, also known as nearest-neighbor connection, uses the shortest distance to allocate an element to a group. The Complete, contrary to the Single, groups the neighbors further away and therefore uses the greatest distance between the elements of the group. The Average method is analogous to the previous, except that the distance between the groups is taken as the average between two elements of each group. Figure 1 illustrates three methods. Finally, in Ward method, the allocation of an element to a group is performed in order to minimize a measure of internal homogeneity, i.e. every step of the method, add objects in order to homogenize the groups. The measure of homogeneity is based on the sum of squares total a analysis of variance (Bieniasz and Majchrzak, 2011).

Non-Hierarchical Methods (K-means)
The method is considered an unsupervised learning algorithm in which the clusters are grouped according to similarity of the elements (Ribas et al., 2012). The method was proposed by Hartigan and Wong (1979) and is still considered one of the most robust clustering algorithms, due to sensitivity to the presence of outliers. For Everitt and Hothorn (2011), the sensitivity is from matrix of distances used in getting the centroides. The number of groups to be composed is indicated by the letter . In this study, refers to the ability of the user to operate the software.
The algorithm performed by k-means is to position the clusters in the same space using the Euclidean distance with measurement of similarity. The position of the groups is obtained through the centroid, defined as the sum of each dimension of the space divided by the number of cluster elements. To add an element to the cluster, select the group with the shortest distance to the element. The pseudocode observing runs all objects and all groups doing permutations between them when needed (Algorithm 2).
Algorithm 2: Grouping based on k-means method Input: Data with measured objects in variables Input: Number of groups k Output: Matrix of distances among the objects considering all the variables 1. Creates k groups randomly 2. While i = 1 → nthere is change in groups then 3.
Uses the centroid calculated to sort objects 4.
End 7. End METHODOLOGY This research is qualitative-quantitative (Appolinario, 2016), because it analyzes the characteristics of the participants of the sample in order to classify them into subgroups according to their technological abilities. For this, we use quantitative data generated from answers of survey applied with the respondents.
In addition, it has characteristics of the exploratory research, according to the classification of the research based on its objectives (Gil, 2002), because it intends to know better the problem and make it better known in the scientific community, including explaining the solution of the problem of research.
As for the technical procedures used (Gil, 2002), in the beginning, this research uses bibliographical and survey research. Bibliographic research is necessary for a better understanding of theoretical aspects of research, while survey research is necessary to collect data from the samples in order to be analyzed by the statistical methods chosen for the research, in order to group the research participants, in accordance with his skills with technology.
The research, then, sought to know the skills of teachers with the use of technology. The sample consisted of 71 teachers from educational institutions in the Brazilian states of Amazonas, Alagoas, Bahia, Ceará and Pernambuco and followed the type of survey surveys based on lists of the random sample model of Von Baur and Florian (2009). Thus e-mail lists of members of educational institutions were used to send invitations to members to respond to the online questionnaire about their ICT skills, attitudes and frequency of use.
Online survey research is a trend, because of the following advantages: low cost, time (speed of delivery to the target public), ease of use, absence of spatial constraints and good response rate (Flick, 2013).
The data collection instrument used was the online questionnaire with objective items about the level of knowledge and frequency of use of technologies and tools used as a didactic resource or as a learning support,  (Johnson and Wichern, 2007) according to adaptations based on surveys of profile of teachers of (Freire et al., 2011), (Martins et al., 2010) and  and (Oliveira et al., 2016) (Table 1).
Frequencies of use followed the Likert scale (Marconi and Lakatos, 2002) with the following options: daily, weekly, monthly, semi-annual, annual, never used or unknown meaning. Participants received the form link in their email boxes as they were sent from institutional mailing lists. The first part of the survey consisted of a free and clarified term that presented the objectives of the research, the researchers involved, the participants 'rights, mainly, the guarantee of the exclusive use of the data for research purposes and with maximum confidentiality of identities of participants. Thus, the participant could accept or not participate in the research from the response to the survey item that inquired about such a situation. This survey remained online for about sixty days.
After this period the data were tabulated and formatted in spreadsheet editor so that the data were processed through software R, using the algorithm based on the grouping technique to define and create the subgroups. For this, the skyeans, pvclust and cluster packages were used to construct the graphs and execute the algorithms.
In the first moment, the profiles will be identified in relation to the abilities with use of technology of the sample from the use of the Cluster Analysis, also known as Cluster Analysis, Classification Analysis or Cluster Analysis. In this case, the main objective is to group the elements of the sample, according to the characteristics answered in the survey on the use of technology, as they present similarities among each other, because if we consider the total sample it will be possible to perceive heterogeneous characteristics among the sample individuals (Mingoti, 2005).  (Freire et al., 2011;Martins et al., 2010;Oliveira et al., 2010) Survey with Ability in Technology and Education Support Tools  The representation of the data will be presented in the next section through graphs, specifically, dendrograms that brought together groups of users with ability in technology (AT) and low ability in technology (LAT).

RESULTS AND DISCUSSION
The groups were constituted from similar characteristics existing among the individuals of the sample, who presented them in the phase of data collection survey when answering the questionnaire on profile survey. The algorithms of the hierarchical method were applied in Single (Figure 2), Complete (Figure 3), Average ( Figure  4) and Ward (Figure 5).
The Figures 2, 3, 4 and 5 show dendograms of the hierarchical method with sample distribution among a group of users with skill with technology (AT) and with little ability with technology (LAT).  The four methods are based on Euclidean distance. Therefore, the more similar the responses of individuals, the smaller the distance between them. The analysis of the dendograms, generated from the data of the survey, allows to conclude that Ward grouping algorithm (Figure 5) can present a more coherent configuration of the individuals in each subgroup. This result was already expected, considering the robustness of the method by homogenizing the groups exhaustively in the iterations.
The algorithm of the non-hierarchical method (Figure 6) allows the visualization of the two subgroups in order to compare with Ward result (Figure 5). In both, the grouping is the same. The individuals, measured in the 30 variables, are placed in function of linear combinations. In Figure 6, the main components technique was used. In Figure 7, the discriminant function was used, providing a greater power of discrimination, as the name suggests.  For more details, see (Johnson and Wichern, 2007). In Figures 6 and 7, the plotted points are the elements rewritten in function of the respective linear combinations after the execution of the algorithm used by the method.
By grouping the survey participants using Ward and K-means, we can see the similarity of 92% of the users grouped in the profiles AT and LAT. There was an 8% divergence in the groupings performed by the two mentioned techniques, which occurred by the inclusion of 6 individuals in the group of users with low ability by the k-means technique, while Ward grouped the individuals in the profile AT (Table 2).
Therefore, when analyzing the profiles of the users in each group, we can see items 2, 12, 5, 4, 30, 13, 19, 28, 1, 11 of Table 1 are the most relevant for defining the users in the AT group or LAT, while items 14, 8, 29, 26, 9 and 6 are the least relevant.

CONCLUSIONS
The use of these clustering methods is a convenient resource for research whose purpose is to segregate elements into subgroups impartially, free of subjective intervention. The grouping is performed through algorithms that have the Euclidean distance as execution criterion. Therefore, it is not known which variable has the greatest influence among those used, since the measure concentrates multivariate information into a single value.
The technique is purely descriptive, not allowing the application of hypothesis tests or the making of inferences. However, it is appropriate in scenarios with the purpose of classification in which the response pattern of the individuals belonging to each specific group is unknown, especially when using many variables. That is, it is not previously known the profile of an individual with technology ability for the subsequent association of an element with their profile.
After defining the groups, it is possible to identify the similar characteristics of the individuals that make up each one. In this way, it becomes possible to develop strategies to provide capabilities in specific characteristics to make individuals with little knowledge with technology can reach the group with skill with technology.
Thus, it is possible to abstract such sets of procedures from this research to other scientific applications in several areas of knowledge and also yearn for the representative selection of the sample, according to established research needs.