Semantic Web Ontologies: What Works and What Doesn't
by Peter Norvig 01.12.05
It has four challenges were. The first is chicken and egg problem: How do we build
this information because unless one is you get the information you can build tools, on
the other hand unless you have tools you can put information into them. My friend
just asked me whether the network all the URLs (including the dot-RDF, dot-OWL,
and two additional part) were sent to him, because he could not find all the parts. I
looked, and there were only 200,000 showed up and down, this is probably the
network of 0.005%. We need a way to solve it.
The next question is about the ontology of the competition. Everybody look at it a
different way. If you have some tools to measure it, we will see how wide its scope.
Then the problem is Cyc, which is the background knowledge and garbage problems.
This is the day I must face. As you leave the laboratories and into the real world, there
are financial interests of the people will try to beat you.
So the chicken and egg problem, that is, "in what type of semantic
technology in interesting information, and other information where to?" It
showed that most of the useful information is still in your text. Our concern is how
you extract it out from the text. Here is a small demonstration called IO node. You can
type a natural language problem, then it will be removed from the text document, and
remove the semantic entities. And you will see, this is not quite perfect - for example,
to properly resolve the problem spelling. But everything is done automatically, it does
not do the information into the correct position on this kind of work.
In short, the semantic technologies that look good on the definition of graphic appears,
but on the other hand what should go into graphic it. In order to do it, also needs to be
This is another example. This is the last night of the Google News page, here we
apply the pool (clustering) technology, the news together different categories, so you
can see the first news about Blair, and there are 658 related news we put them
together to together.
Now imagine, if you do not use our algorithm (the news was received from the news
provider), but according to the way they want to put all the meta data (metadata) and
labels. "My story will be buried in 20 pages of it, or news headlines? I will
put my metadata. I am talking about those who are terrorists or freedom fighters?
What is the definition of patriot? What is the definition of marriage ? "
When you talk about these political issues rather than on the part of the figures, it is
your ontology for such a definition; this will be a political statement. One might be
attracted all-powerful. These are the ontology is not where the operation will generate
controversy in their body. And you have to turn to rely on other methods.
Ontology best place to operate, when you have the privilege of consumers, such as to
force service providers to serve you. Such as auto parts industry, automobile
manufacturing, where party said, "Every person who wants to sell us to do
this." They can do so because of their small number. In other industries, if
there is a major "player" because they do not want others to
catch up, and then wanted to not participate in the contest. If there are many smaller
"players", it is difficult to organize them to together.
Semantic technology to fundamentally break the information into dispersion is good.
But basically you carefully consider the only bracket (Note: refer to
<>) between the part. And one of our founders Sergey Brin, has said
so, "put something into a bracket is not the technology itself."
The question is what should enter the angle brackets. You can say, "Well,
my database has a name in the field, and your database has a name field and a name
section, we will have in a match is found between them in touch." But This
is not always so effective.
This company has a few days in the google example of question worth considering is
that our "spelling correction" functions are built in accordance
with a standard form. This is one of the questions are asked the most, and a week
there are 4,000 different kinds of spelling variations in development. Someone want
to do this specification. Therefore, understanding the text of the problem disappears, it
has been forced between the angle brackets, broken down into smaller pieces. So this
is the correct spelling of the problem; the problem of translation, such as the Arabic
translation of a Roman alphabet; a short question: HP and Hewlett Packard, HP, etc.;
have the same name in question: Michael Jordan is a basketball player CEO, or the
Berkeley professor.
Let us now consider the background knowledge. Cyc project seeks to define in a
dictionary all the knowledge, a Dublin Core type of thing, and then find in the
dictionary or encyclopedia is not, but we still need the material. Lenat and Guha said
that there is a vast repository of knowledge, you rarely talk about, such as
"water down flow" and "matter of life got the
I think we can try to do such a large project. Then I decided to simplify it a bit - just
put quote tags around it and typed it. So when I type "water flows
down", I got 1,200 results. The first result says, "This is a
kindergarten teacher Emily's teaching program." It actually
explains why the water flows down, and this is you will not find in the encyclopedia.
The conclusion here is Lenat99.999993% correct, because the 4.3 billion cases, only
1,200 actually discussed the downward flow of water. But enough, and you can
continue to find. You can use the Internet to vote, you can say pumps can increase the
water flow, but Nazhi occurred 275 times, it flows down the victory - and 1,200 on
the 275.
In fact, we are here to do is to use a large number of untrained manpower, you do not
pay for all the work can be completed, and its relative, the trained people to use a
clear definition of the forms and written text in that form so we went to use the
material it is already there. I tried to think of "non-skilled labor,"
the results of trying to use it for the use of a large number of large data using
statistical techniques, and where the filter through in your own, not fully comply with
your definition.
The last question was spam. When you define in the laboratory and on your body,
everything looks good and clean. But once you release it in the cyber world, you will
find how so many devious people do. This is an example: it looks like two pages. This
is actually one. On the left page is the Googlebot (Google's web search
robots) to see the right side of the page is seen by other user agent. This site as it once
saw, it displays the page allows us to better match it, and when an
ordinary user to browse, it shows to display the page.
This shows that: first, we have a lot of work to do, to deal with such things, but you
can not trust metadata. You can not believe what people will say. In short, the search
engine free from the shackles of metadata, they seek more on the user's
experience on the effort. A large extent, we discarded the meta tag, unless there is a
real reason to believe that they, because they are more likely to cheat than helpful.
And if there is that people may use deception to make money in the market, it is more
likely to occur. People are very good at finding such spam, but the machine is not
necessarily as good. Therefore, if more flow between the machine, which is more and
more you see.
