Identifying semantic characteristics of user interaction datasets through application of a data analysis

Bibliographische Detailangaben

Titel
Identifying semantic characteristics of user interaction datasets through application of a data analysis
verantwortlich
Ricardo César Gonçalves Sant’Ana; Pedro Henrique Santos Bisi; Fernando de Assis Rodrigues
Erscheinungsjahr
2020
Medientyp
Preprint
Datenquelle
LISSA
sid-179-col-lissa
Tags
Tag hinzufügen

Zugang

Diese Ressource ist frei verfügbar.

Zusammenfassung
The study goal is to identify semantics characteristics of datasets, at the moment of data collecting, from dataset's structures found on export data interfaces available on user’s interactions analysis tools, on Internet communication channels, and on statistical data access tools involved in a scientific journal management process, thru an application of data analysis and data model techniques. The research universe was delimited to exportable dataset's structures, found in journal publishing systems, online social networks statistics, search engines, and web analytics tools. The sample analyzed was restricted to dataset's structures, available in reports found in Open Journal Systems (OJS), Google Analytics, Google Search Console, Twitter Analytics, and Facebook Insights. These resources did not present any version control numbering, except by OJS (2.6). The data was collected in September' 2017 from "Electronic Journal Digital Skills for Family Farming" accounts. It was adopted an exploratory analysis methodology to identify characteristics about how data are available and structured on those data resources, contemplating a systematically describing process of datasets, entities, and attributes related to the interaction between users and communications channels from a scientific journal. A total of 255 exportable datasets were found, distributed in 5 file formats: Comma-Separated Values (CSV) (82), Google Docs Spreadsheet File Format (69), Excel Microsoft Office Open XML Format Spreadsheet file (50), Portable Document Format (50), and Excel Binary File Format (3). Except for CSV, all other file formats were discarded, mainly because CSV is a machine-readable, open file format, and available in every export data interfaces analyzed. It was collected 82 CSV datasets from Google Analytics (50), Google Search (20), Open Journal Systems (7), Facebook Insights (3), and Twitter Analytics (2). In order to systematize the analysis, it was applied concepts from Entity-Relationship (ER) Model (Silberschatz, Korth, & Sudarshan, 2010) with entities to store data collected from i) services, ii) resources available in the services, iii) datasets available in the resources, and iv) attributes available in the datasets. Also, it was developed two auxiliary tables i) format, to store file format types available on datasets, and ii) data type to store data types: "a named (and in practice finite) set of values" (Date, 2016, p. 228). This applied ER Model provides a structure to store data from entities and attributes from each dataset. Applying this ER structure on data collected in this study was possible to identify 82 entities, 2280 attributes, with a subset of 1342 unique attribute labels. The ER structure and data was stored in a Google Spreadsheet file. After that, the file was uploaded to a DataBase Management System (DBMS) to a further data analysis. It was developed a Python script to reorder the data stored in DBMS to a new data structure, adopting the Online Analytical Processing (OLAP) cube as representation with Service (s), Entity (e), and Attribute (a) data used as dimensions (Gray, Bosworth, Lyaman, & Pirahesh, 1996; Inmon, 1996; Kimball & Ross, 2011). The collected data was reordered to OLAP cube dimensions by a pivot table process (Cornell, 2005). It was intended to observe on intersections of OLAP cube the characteristics shared internally and externally by services, entities and, attributes that can affect semantics aspects on data collecting. The results show that 88.69% of attributes doesn't it relate to any description about its content. Added to that, all attributes that share equal labels between distinct services came without description on collecting. This subset of attributes had a significant importance to interoperability applicability of those datasets, with a capability to distinguish the context on collecting process and also be part of a group of potential primary keys or unique fields, helping to build relationships between data from this sources, or even in a geographic, timing or linguistic determination.
Sprache
Englisch
DOI
10.31229/OSF.IO/U8ATZ