-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathFPL.html
297 lines (253 loc) · 17.2 KB
/
FPL.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>Sieve – Data Fusion Policy Learner</title>
<link rel="StyleSheet" href="stylesheets/style.css" type="text/css" media="screen"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div id="wbsg_navbar">
<div id="wbsg_navbar_projects">
<a href="http://dbpedia.org" title="DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web" >DBpedia</a>
<a href="http://spotlight.dbpedia.org/" title="DBpedia Spotlight is a tool for annotating DBpedia entities in text.">DBpedia Spotlight</a>
<a href="http://d2rq.org/d2r-server" title="D2R Server is a tool for publishing the content of relational databases on the Semantic Web" >D2R Server</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/r2r/" title="R2R Framework – Translating RDF data from the Web to a target vocabulary" >R2R</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/" title="The Silk framework is a tool for discovering relationships between data items within different Linked Data sources" >Silk</a>
<a href="http://sieve.wbsg.de/" title="Sieve is a tool for assessing data quality and performing data fusion." class="wbsg_navbar_active_project">Sieve</a>
<a href="http://ldif.wbsg.de/" title="LDIF – Linked Data Integration Framework translates heterogeneous Linked Data from the Web into a clean, local target representation while keeping track of data provenance" >LDIF</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/ng4j/" title="The Named Graphs API for Jena (NG4J) is an extension to the Jena Semantic Web framework for parsing, manipulating and serializing sets of Named Graphs" >NG4J</a>
<a href="http://mes.github.com/marbles/" title="Marbles is a server-side application that formats Semantic Web content for XHTML clients using Fresnel lenses and formats" >Marbles</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/wiqa/" title="The WIQA - Information Quality Assessment Framework is a set of software components that empowers information consumers to employ a wide range of different information quality assessment policies to filter information from the Web" >WIQA</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/pubby/" title="Pubby – A Linked Data Frontend for SPARQL Endpoints can be used to add Linked Data interfaces to SPARQL endpoints" >Pubby</a>
<a href="http://wifo5-03.informatik.uni-mannheim.de/bizer/rdfapi/" title="RAP – RDF API for PHP is a software package for parsing, querying, manipulating, serializing and serving RDF models" >RAP</a>
</div>
<!--div id="wbsg_navbar_intro">Open Source projects by the <a href="http://wbsg.de">Web-based Systems Group</a>: </div-->
<div id="wbsg_navbar_intro">Open Source projects by the <a href="http://dws.informatik.uni-mannheim.de/">Data and Web Science Group</a>: </div>
</div>
<!-- End WBSG navbar -->
<!--div id="logo" align="right">
<a href="http://wbsg.de"><img src="http://ldif.wbsg.de/images/fu-logo.gif" alt="Freie Universität Berlin Logo" border="0"></a>
</div-->
<DIV id=logo><A href="http://dws.informatik.uni-mannheim.de/"><IMG src="images/logo_uni_en.gif" alt="Universität Mannheim Logo"></A> </DIV>
<div id="header">
<h1 style="font-size: 200%;">Sieve – Data Fusion Policy Learner (FPL)</h1>
</div>
<div id="tagline">A Sieve module for automatically learning data fusion policies</div>
<div id="authors">
<a href="http://dws.informatik.uni-mannheim.de/en/people/researchers/dr-volha-bryl/">Volha Bryl</a><br>
<a href="http://dws.informatik.uni-mannheim.de/en/people/professors/prof-dr-christian-bizer/">Christian Bizer</a><br>
</div>
<div id="body-container">
<h2 id="news">News</h2>
<div>
<ul>
<li><b>05/02/2014</b>: <a href="#ref:WebQuality2014">Paper</a> about the FPL and the DBpedia data fusion use case will be presented at the <a href="http://www.dl.kuis.kyoto-u.ac.jp/webquality2014/">WebQuality workshop</a> of
<a href="http://www2014.kr/">WWW'2014</a>. [<a href="http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Bryl_Bizer_webquality14.pdf">pdf</a>]</li>
<li><b>19/12/2013</b>: First version realesed within <a href="http://ldif.wbsg.de/">LDIF 0.5.2</a>, check the <a href="https://github.com/wbsg/ldif/tree/master/ldif/ldif-modules/ldif-sieve/ldif-sieve-fpl">LDIF repository</a>.</li>
</ul>
</div>
<h2 id="contents">Contents</h2>
<div>
<ol class="toc">
<li><a href="#about">About</a></li>
<li><a href="#input">Input Specification</a></li>
<li><a href="#output">Output</a><br></li>
<li><a href="#start">Quick Start and Examples</a><br></li>
<li><a href="#development">Source Code and Development</a></li>
<li><a href="#feedback">Support and Feedback</a></li>
<li><a href="#references">References</a></li>
<li><a href="#acknowledgments">Acknowledgments</a></li>
</ol>
</div>
<div>
<h2 id="about">1. About</h2>
<p>
<a href="http://sieve.wbsg.de/">Sieve</a> data quality assessment and data fusion framework provides a library of quality assessment scores and fusion functions,
and an XML-based input specification language to combine them. However, manually defining a fusion policy - that is, which fusion function and with what parameters
to use for which data attribute - requires certain domain knowledge and understanding of the input data, is time-consuming, and does not guarantee an optimal result.
</p>
<p>
Therefore, we introduce <b>Fusion Policy Learner</b>, an extension of Sieve that allows automatically selecting an optimal fusion function based on the gold standard.
The user is still involved in the process as the list of possible fusion functions per property needs to be manually specified.
</p>
<p>
The learning algorithm implemented in the Fusion Policy Learner selects the fusion function that minimizes the error with respect to a <i>gold standard</i>.
The list of fusion functions and the respective quality assessment scores for each property are defined in the input specification, which is written
using the extended version of the XML-based Sieve specification language.
</p>
<p>
The learning algorithm first detects, based on the gold standard, whether the values to fuse are numeric or nominal; examples of nominal values are strings and URIs.
</p>
<p id="learning">
Then for numeric properties one of the two learning strategies is applied: <b>(1)</b> a fusion function that minimizes the <i>mean absolute error</i> with respect to the gold
standard is selected, or <b>(2)</b> given a <i>maximum error threshold</i> (e.g. 5%), the function that maximizes the number of values that deviates from the gold standard
no more than by a threshold, is selected. In case of nominal values, the fusion function that produces the maximum number of exact matches with the gold standard is selected.
</p>
<h2 id="input">2. Input Specification</h2>
<p>Below you see an example of input specification for the Fusion Policy Learner:</p>
<pre>
1 <SieveFPL>
2 <Parameters>
3 <!--SelectionMethod name="MinAbsError"/-->
4 <SelectionMethod name="MaxCorrectValues" error="0.05"/>
5 </Parameters>
6 <Input>
7 <GoldStandard>gold\cities1000-Netherlands.gold.nt</GoldStandard>
8 <dumpLocation>dumps-nl</dumpLocation>
9 <SieveExec>c:\ldif-0.5.2\bin\ldif.bat</SieveExec>
10 </Input>
11 <Output>
12 <SieveSpec>sieve-optimal\sieve_optimal.xml</SieveSpec>
13 <FPLReport>FPL_report.txt</FPLReport>
14 <!--FPLReport valmatrix = "true">FPL_report.txt</FPLReport-->
15 </Output>
16 <Sieve xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
17 <Prefixes>
18 <Prefix id="dbpedia-owl" namespace="http://dbpedia.org/ontology/"/>
19 <Prefix id="ldif" namespace="http://www4.wiwiss.fu-berlin.de/ldif/"/>
20 <Prefix id="sieve" namespace="http://sieve.wbsg.de/vocab/"/>
21 <Prefix id="dbpedia-meta" namespace="http://dbpedia.org/metadata/"/>
22 </Prefixes>
23 <QualityAssessment>
24 <AssessmentMetric id="sieve:authactivity">
25 <ScoringFunction class="NormalizedCount">
26 <Param name="maxCount" value="4250000"/>
27 <Input path="?GRAPH/dbpedia-meta:autheditcnt"/>
28 </ScoringFunction>
29 <AssessmentMetric id="sieve:recency">
30 <ScoringFunction class="TimeCloseness">
31 <Param name="timeSpan" value="500"/>
32 <Input path="?GRAPH/dbpedia-meta:lastedit"/>
33 </ScoringFunction>
34 </AssessmentMetric>
35 </QualityAssessment>
36 <Fusion>
37 <Class name="dbpedia-owl:PopulatedPlace">
38 <Property name="dbpedia-owl:areaTotal">
39 <FusionFunction class="KeepFirst" metric="sieve:recency"/>
40 <FusionFunction class="KeepFirst" metric="sieve:authactivity"/>
41 <FusionFunction class="Voting"/>
42 <FusionFunction class="Average"/>
43 </Property>
44 <Property name="dbpedia-owl:populationTotal">
45 <FusionFunction class="KeepFirst" metric="sieve:recency"/>
46 <FusionFunction class="Average"/>
47 <FusionFunction class="Maximum"/>
48 </Property>
49 </Class>
50 </Fusion>
51 </Sieve>
52 </SieveFPL>
</pre>
<p>
The root tag is <SieveFPL>, under which <b>4 elements</b> have to be specified:
parameters of the learning algorithm (lines 2-5),
input (lines 7-11) and output (lines 12-15) paths,
and the extended Sieve specification (lines 16-20).
</p>
<p>
The <b><SelectionMethod></b> element (lines 3-4) is used to specify the parameters of the learning algorithm for numeric data values.
<i>MinAbsError</i> value of the <i>name</i> attribute corresponds to the 1st learning strategy (<a href="#learning">see above</a>),
and <i>MaxCorrectValues</i> – to the 2nd one with <i>error</i> attribute defining the maximum error threshold.
</p>
<p>In the <b><Input></b> section the following parameters are specified:</p>
<ul>
<li><GoldStandard>: path to the gold standard dump, should be in N-TRIPLE format <i>(all paths are relative to the specification file path)</i>;</li>
<li><dumpLocation>: path to the input dumps, should be in N-QUAD format (the same requirement as for the input data in Sieve);</li>
<li><SieveExec>: path to the Sieve shell script or batch file.</li>
</ul>
<p>The <b><Output></b> section contains the following two paths to the output files generated by the tool:</p>
<ul>
<li><SieveSpec>: path to the resulting Sieve specification file;</li>
<li><FPLReport>: path to the report summarizing the learning process.</li>
</ul>
Optional <i>valmatrix</i> attribute of <FPLReport> (line 14 should be uncommented to replace line 13) allows including the <i>value matrix</i> to the report, that is,
for each property a list of values (one per fusion function) for each subject URI is listed. For the sample specification above, the value matrix for <i>dbpedia-owl:populationTotal</i>
(lines 45-47) is as follows. The values refer to <i>keep the most recent</i>, <i>average</i>, <i>maximum </i> and gold standard value, respectively; in the breakets the data source
is specifyied (language code in case of Wikipedia namespaces).
<pre>
http://dbpedia.org/resource/Buitenpost 5834 (en) 5777 (average) 5834 (en) 5764 (gold)
http://dbpedia.org/resource/Noordwijkerhout 15541 (it) 15460 (average) 15541 (ru) 15071 (gold)
http://dbpedia.org/resource/Harenkarspel 15922 (ru) 15973 (average) 16076 (it) 15941 (gold)
</pre>
<p>
<b><Sieve></b> element contains the extended or "redundant" Sieve specification: for each property a list of fusion functions is defined,
and the FPL selects an optimal one from the list with respect to the gold standard.
</p>
<p>
In <a href="#input">lines 39-42</a> 4 fusion functions - <i> keep the most recent value, keep the value added by the most active author, most frequent, average</i> - are specified for the <i>areaTotal</i>
property of a populated place. The FPL chooses and puts into the final Sieve specification (<a href="#input">line 12</a>) only one of these functions, in accordance with the selection method defined in
<Parameters>.
</p>
<p>
In the current FPL version learning can be performed for only one class at a time, which means that learning an optimal fusion policy for the properties
of e.g. <i>dbpedia-owl:CelestialBody</i> would require another specification file.
</p>
<h2 id="output">3. Output</h2>
The output of the tool consists of
<ul>
<li>Sieve specification file (<a href="#input">line 12</a> defines the file name and location, relative to the FLP specification file) that can be used directly for data fusion with Sieve, </li>
<li>FPL report file (<a href="#input">line 13</a> defines the file name and location), in which for each property the errors per fusion function are listed.</li>
</ul>
Below you see an extract of the report corresponding to the <a href="#input">FPL specification presented above</a>.
<pre>
*** Learning an optimal fusion function for dbpedia-owl:populationTotal property ***
Number of gold standard values = 493
According to the gold standard, dbpedia-owl:populationTotal is NUMERIC
Pool of fusion functions:
0 : <FusionFunction class="KeepFirst" metric="sieve:recency"/>
1 : <FusionFunction class="Average"/>
2 : <FusionFunction class="Maximum"/>
Errors per fusion function (functions identified by int ID):
0, mean absolute error : 0.02368750565457104, count : 493.0
0, number of 5.0% correct values : 318
1, mean absolute error : 0.022413082496238555, count : 493.0
1, number of 5.0% correct values : 301
2, mean absolute error : 0.021932866337685517, count : 493.0
2, number of 5.0% correct values : 329
MinAbsError: best fusion function ID, error %, count: 2, 2.193286633768552, 493.0
MaxCorrectValues: best fusion function ID, number of correct values : 2, 329
</pre>
<p>
The report contains the number of gold standard values for <i>dbpedia-owl:populationTotal</i> property (<i>493</i>), the detected property type (<i>numeric</i>),
the list of the fusion functions as defined for <i>populationTotal</i> in the FPL specification. Fusion functions are assigned numeric IDs (<i>0, 1, 2</i>),
which are then used to report errors for each functions with respect to the gold standard.
</p>
<p>
For numeric properties, errors are reported for both learning methods: <i>MinAbsError</i> and <i>MaxCorrectValues</i> (with the default 5% threshold), and the optimal
(referred as "best" in the report) functions for both methods are listed. In our example, both selection methods resulted in <i>maximum</i> fusion function to be the optimal one.
The final Sieve specification will include only the function which is optimal according to <i>MaxCorrectValues</i> method, as defined in <a href="#input">line 4 of our sample FPL specification</a>.
</p>
<h2 id="start">4. Quick Start and Examples</h2>
<p>
In order to demonstrate the functionalities of the Fusion Policy Learner, the <i>multilingual DBpedia</i> example is distributed with the LDIF binaries
(<i>dbpedia-multilang</i> directory) and can be found in the <a href="https://github.com/wbsg/ldif/tree/master/ldif/examples/dbpedia-multilang/">LDIF repository</a>.
The example aims at fusing data for the same city from multiple language editions of DBpedia. In the example directory you find the input specification (<i>SieveFPL.xml</i>)
along with data and provenance metadata dumps (<i>dumps-3cities</i> and <i>dumps-nl</i> directories, use one of the two when specifying <i>dumpLocation</i> in
<a href="#input">line 8</a>) and gold standard (in <i>gold</i> directory).
</p>
<p>
To run the FPL with the multilingual DBpedia example, download and unpack the LDIF binaries, and run
</p>
<i>java -jar lib\ldif-sieve-fpl-0.1.1-jar-with-dependencies.jar examples\dbpedia-multilang\SieveFPL.xml</i>
<p>
from the directory you have put the binaries to.
</p>
<h2 id="development">5. Source Code and Development</h2>
<p>The latest source code is available from the <a href="http://github.com/wbsg/ldif/">LDIF development page</a> on GitHub.</p>
<p>The framework can be used under the terms of the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache Software License</a>.</p>
<h2 id="feedback">6. Support and Feedback </h2>
<p>For questions and feedback please use the <a href="http://groups.google.com/group/ldif?hl=en">LDIF Google Group</a>.</p>
<h2 id="references">7. References</h2>
<div>
<ul>
<li id = "ref:WebQuality2014">Volha Bryl, Christian Bizer. <b>Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion</b>.
4th Joint WICOW/AIRWeb Workshop on Web Quality Workshop (WebQuality) @ WWW 2014.
[<a href="http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Bryl_Bizer_webquality14.pdf">pdf</a>]
</ul>
</div>
<h2 id="acknowledgments">8. Acknowledgments </h2>
<p>This work was supported by the EU FP7 grant <a href="http://lod2.eu/">LOD2 - Creating Knowledge out of Interlinked Data</a> (Grant No. 257943).</p>
<!--p>WooFunction icon set licensed under <a href="http://www.gnu.org/licenses/gpl.html">GNU General Public License</a>.</p-->
</div>
</div>
</body>
</html>