RT Dissertation/Thesis T1 Towards a human-centric data economy A1 Andrés Azcoitia, Santiago AB Spurred by widespread adoption of artificial intelligence and machine learning, “data” is becominga key production factor, comparable in importance to capital, land, or labour in an increasinglydigital economy. In spite of an ever-growing demand for third-party data in the B2Bmarket, firms are generally reluctant to share their information. This is due to the unique characteristicsof “data” as an economic good (a freely replicable, non-depletable asset holding a highlycombinatorial and context-specific value), which moves digital companies to hoard and protecttheir “valuable” data assets, and to integrate across the whole value chain seeking to monopolisethe provision of innovative services built upon them. As a result, most of those valuable assetsstill remain unexploited in corporate silos nowadays.This situation is shaping the so-called data economy around a number of champions, and it ishampering the benefits of a global data exchange on a large scale. Some analysts have estimatedthe potential value of the data economy in US$2.5 trillion globally by 2025. Not surprisingly, unlockingthe value of data has become a central policy of the European Union, which also estimatedthe size of the data economy in 827C billion for the EU27 in the same period. Within the scope ofthe European Data Strategy, the European Commission is also steering relevant initiatives aimedto identify relevant cross-industry use cases involving different verticals, and to enable sovereigndata exchanges to realise them.Among individuals, the massive collection and exploitation of personal data by digital firmsin exchange of services, often with little or no consent, has raised a general concern about privacyand data protection. Apart from spurring recent legislative developments in this direction,this concern has raised some voices warning against the unsustainability of the existing digitaleconomics (few digital champions, potential negative impact on employment, growing inequality),some of which propose that people are paid for their data in a sort of worldwide data labourmarket as a potential solution to this dilemma [114, 115, 155].From a technical perspective, we are far from having the required technology and algorithmsthat will enable such a human-centric data economy. Even its scope is still blurry, and the questionabout the value of data, at least, controversial. Research works from different disciplines havestudied the data value chain, different approaches to the value of data, how to price data assets,and novel data marketplace designs. At the same time, complex legal and ethical issues withrespect to the data economy have risen around privacy, data protection, and ethical AI practices. In this dissertation, we start by exploring the data value chain and how entities trade data assetsover the Internet. We carry out what is, to the best of our understanding, the most thorough surveyof commercial data marketplaces. In this work, we have catalogued and characterised ten differentbusiness models, including those of personal information management systems, companies bornin the wake of recent data protection regulations and aiming at empowering end users to takecontrol of their data. We have also identified the challenges faced by different types of entities,and what kind of solutions and technology they are using to provide their services.Then we present a first of its kind measurement study that sheds light on the prices of datain the market using a novel methodology. We study how ten commercial data marketplaces categoriseand classify data assets, and which categories of data command higher prices. We alsodevelop classifiers for comparing data products across different marketplaces, and we study thecharacteristics of the most valuable data assets and the features that specific vendors use to setthe price of their data products. Based on this information and adding data products offered byother 33 data providers, we develop a regression analysis for revealing features that correlate withprices of data products. As a result, we also implement the basic building blocks of a novel datapricing tool capable of providing a hint of the market price of a new data product using as inputsjust its metadata. This tool would provide more transparency on the prices of data products inthe market, which will help in pricing data assets and in avoiding the inherent price fluctuation ofnascent markets.Next we turn to topics related to data marketplace design. Particularly, we study how buyerscan select and purchase suitable data for their tasks without requiring a priori access to suchdata in order to make a purchase decision, and how marketplaces can distribute payoffs for adata transaction combining data of different sources among the corresponding providers, be theyindividuals or firms. The difficulty of both problems is further exacerbated in a human-centricdata economy where buyers have to choose among data of thousands of individuals, and wheremarketplaces have to distribute payoffs to thousands of people contributing personal data to aspecific transaction.Regarding the selection process, we compare different purchase strategies depending on thelevel of information available to data buyers at the time of making decisions. A first methodologicalcontribution of our work is proposing a data evaluation stage prior to datasets being selectedand purchased by buyers in a marketplace. We show that buyers can significantly improve theperformance of the purchasing process just by being provided with a measurement of the performanceof their models when trained by the marketplace with individual eligible datasets. Wedesign purchase strategies that exploit such functionality and we call the resulting algorithm TryBefore You Buy, and our work demonstrates over synthetic and real datasets that it can lead tonear-optimal data purchasing with only O(N) instead of the exponential execution time - O(2N)- needed to calculate the optimal purchase. With regards to the payoff distribution problem, we focus on computing the relative valueof spatio-temporal datasets combined in marketplaces for predicting transportation demand andtravel time in metropolitan areas. Using large datasets of taxi rides from Chicago, Porto andNew York we show that the value of data is different for each individual, and cannot be approximatedby its volume. Our results reveal that even more complex approaches based on the“leave-one-out” value, are inaccurate. Instead, more complex and acknowledged notions of valuefrom economics and game theory, such as the Shapley value, need to be employed if one wishesto capture the complex effects of mixing different datasets on the accuracy of forecasting algorithms.However, the Shapley value entails serious computational challenges. Its exact calculationrequires repetitively training and evaluating every combination of data sources and hence O(N!)or O(2N) computational time, which is unfeasible for complex models or thousands of individuals.Moreover, our work paves the way to new methods of measuring the value of spatio-temporaldata. We identify heuristics such as entropy or similarity to the average that show a significantcorrelation with the Shapley value and therefore can be used to overcome the significant computationalchallenges posed by Shapley approximation algorithms in this specific context.We conclude with a number of open issues and propose further research directions that leveragethe contributions and findings of this dissertation. These include monitoring data transactionsto better measure data markets, and complementing market data with actual transaction pricesto build a more accurate data pricing tool. A human-centric data economy would also requirethat the contributions of thousands of individuals to machine learning tasks are calculated daily.For that to be feasible, we need to further optimise the efficiency of data purchasing and payoffcalculation processes in data marketplaces. In that direction, we also point to some alternativesto repetitively training and evaluating a model to select data based on Try Before You Buy andapproximate the Shapley value. Finally, we discuss the challenges and potential technologies thathelp with building a federation of standardised data marketplaces.The data economy will develop fast in the upcoming years, and researchers from differentdisciplines will work together to unlock the value of data and make the most out of it. Maybethe proposal of getting paid for our data and our contribution to the data economy finally flies,or maybe it is other proposals such as the robot tax that are finally used to balance the powerbetween individuals and tech firms in the digital economy. Still, we hope our work sheds light onthe value of data, and contributes to making the price of data more transparent and, eventually, tomoving towards a human-centric data economy. YR 2023 FD 2023-05-10 LK https://hdl.handle.net/10016/37453 UL https://hdl.handle.net/10016/37453 LA eng NO This work has been supported by IMDEA Networks Institute DS e-Archivo RD 27 jul. 2024