PLAGIARISM DETECTION TECHNIQUES by izy20048

VIEWS: 114 PAGES: 25

									                              Seminar Report
                                    On



      PLAGIARISM DETECTION TECHNIQUES


                               Submitted by

                               JAYA P A




In the partial fulfillment of requirements in degree of Master of Technology
                                 (M Tech)
                                    In
                    Computer and Information Science




           DEPARTMENT OF COMPUTER SCIENCE
 COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
                             KOCHI682022
                                   2007
                          ACKNOWLEDGEMENT


I would like to express my sincere thanks to Lord Almighty without whose blessings I
would not have completed my seminar. I would like to thank all those who have
contributed to the completion of the seminar and helped me with valuable suggestions for
improvement.




I am extremely grateful to Prof. Dr. K Poulose Jacob, Director, Dept.of computer
Science, for providing me with best facilities and atmosphere for the creative work
guidance and encouragement.




I would like to thank my coordinator, G. Santhosh Kumar, Lecturer, Dept.of computer
Science, CUSAT, for all help and support extend to me. I thank all staff members of my
college and friends for extending their cooperation during my seminar.




Above all I would like to thank my parents without whose blessings; I would not have
been able to accomplish my goal.
                                    ABSTRACT

This paper gives an overview of plagiarism and the detection techniques used. Plagiarism
in the sense of "theft of intellectual property" has been around for as long as humans have
produced work of art and research. However, easy access to the Web, large databases,
and telecommunication in general, has turned plagiarism into a serious problem for
publishers, researchers and educational institutions. More and more people begin to
realize that plagiarism is a moral phenomena that can’t exist in society with high ethical
standards. Nowadays many methods to fight against plagiarism are developed and used.
In this paper, I concentrate on plagiarism detection methods and features of this detection
methods. After that, analyses of already developed tools are presented.

Key words: Plagiarism, plagiarism prevention, plagiarism detection, similarity measures
                     CONTENTS


1. Introduction…………………………………………………………...1
2. Describing plagiarism………………………………………………...1
3. Methods to reduce plagiarism………………………………………...3
4. Prevention methods…………………………………………………...5
5. Detection Methods……………………………………………………5
  5.1 Document Source Comparison…………………………………...5
  5.2 Manual Search Of Characteristic Phrases……………………….10
  5.3 Quiz method……………………………………………………..11
6. Available tools………………………………………………………12
  6.1 Attributes of detection tools……………………………………..12
  6.2 Turnitin…………………………………………………………..12
  6.3 Glatt……………………………………………………………...14
  6.4 Jplag……………………………………………………………..16
  6.5 WCopyfind………………………………………………………17
7. Limitations of detection tools………………………………………...20
8. Conclusion……………………………………………………………20
9. References……………………………………………………………21
Plagiarism Detection Techniques



1. INTRODUCTION


Plagiarism is a significant problem on almost every college and university campus. The
problems of plagiarism go beyond the campus, and have become an issue in industry,
journalism, and government activities. Although plagiarism has been a problem for
centuries, the Internet and “Copy/Paste” operation makes plagiarism very easy and
attractive for students in the twenty-first century.

In order to detect plagiarism, comparisons must be made between a target document (the
suspect) and reference documents. A second method is an expansion of the document
check but where the set of target documents is ‘everything’ that is reachable on the
internet and the candidate to be checked for is a characteristic paragraph or sentence
rather than the entire document. The emergence of tools such as Google has made this
type of check feasible.


The remainder of this paper is organized as follows. The next section gives some ideas
about plagiarism methods and then about plagiarism reduction. Then different methods
for plagiarism detection are described. After that analysis of already developed tools are
presented. Finally, some conclusions are given.


2. DESCRIBING PLAGIARISM
Plagiarism can be described as:
        turning in someone else's work as your own
        copying words or ideas from someone else without giving credit
        failing to put a quotation in quotation marks
        giving incorrect information about the source of a quotation
        changing words but copying the sentence structure of a source without giving
        credit
        copying so many words or ideas from a source that it makes up the majority of
        your work, whether you give credit or not



Department of Computer Science, CUSAT                                                   1
Plagiarism Detection Techniques




Plagiarism is derived form the Latin word “plagiarius” which means kidnapper. It is
defined as “the passing off of another person's work as if it were one's own, by claiming
credit for something that was actually done by someone else”. Plagiarism is not always
intentional or stealing some things from some one else; it can be unintentional or
accidental and may comprise of self stealing. The broader categories of plagiarism
include:
• Accidental: due to lack of plagiarism knowledge, and understanding of citation or
referencing style being practiced at an institute.
• Unintentional: the vastness of available information influences thoughts and the same
ideas may come out via spoken or written expressions as one's own.
• Intentional: a deliberate act of copying complete or part of some one else’s work
without giving proper credit to original creator.
• Self plagiarism: using self published work in some other form without referring to
original one.


Commonly in practice there are different plagiarism methods. Some of them include:
        Copy – paste plagiarism (copying word to word textual information);
        Paraphrasing (restating same content in different words);
        Translated plagiarism (content translation and use without reference to original
        work);
        Artistic plagiarism (presenting same work using different media: text, images
        etc.);
        Idea plagiarism (using similar ideas which are not common knowledge);
        Code plagiarism (using program codes without permission or reference);
        No proper use of quotation marks (failing to identify exact parts of borrowed
        content);
        Misinformation of references (adding reference to incorrect or non existing
        source).




Department of Computer Science, CUSAT                                                  2
Plagiarism Detection Techniques



3. METHODS TO REDUCE PLAGIARISM


Nowadays many methods to fight against plagiarism are developed and used. These
methods can be divided into two classes:


(1) Methods for plagiarism prevention, and
(2) Methods for plagiarism detection.


If we consider plagiarism as a kind of social illness then we can say that methods of the
first class are precautionary measures which aim are to preclude rise of illness, but
methods of the second class are cures which are aimed to avert existing illness. Some
examples of methods in each class are as follows: plagiarism prevention – honesty
policies and/or punishment systems, and plagiarism detection – software tools to reveal
plagiarism automatically.


Each method has a set of attributes that determine its application. Two main attributes
which are common to all methods are:


1) Work – intensity of method’s implementation;
2) Duration of method’s efficiency.


Work – intensity of method’s implementation- means amount of resources (mainly time)
which is needed to develop this method and bring into usage. Plagiarism prevention
methods are usually time– consuming in their realization, while plagiarism detection
methods require less time.


Duration of method’s efficiency- means the period of time in which positive effect of
method’s realization exists. Implementation of prevention methods gives a long-term
positive effect. In contrast, implementation of detection methods gives short – term
positive effect. Methods have different duration of positive effect, because of antipodal
approaches which methods use to fight against plagiarism – detection methods based on


Department of Computer Science, CUSAT                                                  3
Plagiarism Detection Techniques


society’s intimidation, while prevention methods more rely on the society’s change of
attitude against plagiarism.



                                                      Attributes of method

           Method                   Implementation
                                    work – intensity           Duration of positive effect


Plagiarism                        Require more time        Positive effect isn’t momentary,
prevention methods                to implement             but it is long – term
Plagiarism                        Require less time        Positive effect is momentary,
detection methods                 to implement             but it is short – term



           Table 3.1: Attributes of plagiarism detection and prevention methods


Despite of differences in prevention and detection methods all these methods are used to
achieve one common goal – to fight against plagiarism. To make this fight efficient,
system approach to plagiarism problem solving is needed, i.e. it is needed to combine
plagiarism prevention and detection methods. To achieve momentary, short – term
positive results plagiarism detection methods must be applied at problem’s initial stages,
but to achieve positive results in long – time period plagiarism prevention methods must
be put into action. Plagiarism detection methods can only minimize plagiarism, but
plagiarism prevention methods can fully eliminate plagiarism phenomena or at least to a
great extent decrease it. That is why plagiarism prevention methods without doubt are
more significant measures to fight against plagiarism. Unfortunately, plagiarism
prevention is a problem for society as a whole, i.e., it is at least national wide problem
which can not be solved by efforts of one university or its department.




Department of Computer Science, CUSAT                                                         4
Plagiarism Detection Techniques



4. PREVENTION METHODS


Honesty Policies: Although plagiarism is reasonably well defined and explained in many
forums, the penalty for cases detected varies from case to case and institution to
institution, many universities have well defined policies to classify and deal with
academic misconduct. Rules and information regarding it are made available to students
during the enrolment process, via information brochures and the university web sites.
        Academic dishonesty can be dealt with at teacher-student level or institute-student
level. The penalties that can be imposed by teachers include written or verbal warning,
failing or lower grades and extra assignments. The institutional case handling involves
hearing and investigation by an appropriate committee, with the accused aware and part
of whole process. The institutional level punishments may include official censure,
academic integrity training exercises, social work, and transcript notation, suspension,
and expulsion, revocation of degree or certificate and possibly even referral of the case to
legal authorities.



5. PLAGIARISM DETECTION METHODS

5.1 DOCUMENT SOURCE COMPARISON

Plagiarism detection usually is based on comparison of two or more documents. A
collection of submitted work is known as a corpus. Where the source and copy
documents are both within a corpus this is known as intra-corpal plagiarism, or
occasionally as collusion. Where the copy is inside the corpus and the source outside, for
instance in a textbook, a submission from a student who took the assessment in a
previous session, or on the Web, this is known as extra-corpal plagiarism.
This approach can be further divided into two categories; one that operates locally on the
client computer and does analysis on local databases of documents or performs internet
searches, the other is server based technology where the user uploads the document and
the detection processes take place remotely. The most commonly used techniques in



Department of Computer Science, CUSAT                                                     5
Plagiarism Detection Techniques


current document source comparison involve word stemming or fingerprinting. The core
finger printing idea has been modified and enhanced by various researchers to improve
similarity detection. Many current commercial plagiarism detection service providers
claim to have proprietary fingerprinting and comparison mechanisms.




   Figure 5.1.1: A generic structure of document source comparison based plagiarism
                                    detection system.

In order to compare two or more documents and to reason about degree of similarity
between them, it is needed to assign numeric value, so called, similarity score to each
document. This score can be based on different metrics. There are many parameters and
aspects in the document which can be used as metrics.


5.1.1 CLASSIFICATIONS OF METRICS

I) NUMBER OF SUBMISSIONS PROCESSED BY THE METRICS USED

First classification is based on the number of documents involved in the metrics
calculation process the methods that the detection engines used to find similarity are of
most academic interest. It is suggested that metrics can be differentiated from one another
based on the number of documents that are processed together to generate them, a set of
classifications not previously found in the literature. Studying technical descriptions of


Department of Computer Science, CUSAT                                                    6
Plagiarism Detection Techniques


detection engines based on the documents they process two main types of metrics have
been identified; it is proposed to name these singular metrics and paired metrics. These
operate on one and two documents at a time respectively to generate a numeric value.
Paired metrics are intended to give more information that could be gleaned by simply
computing and combining two singular metrics.
For completeness corpal metrics and multi-dimensional metrics are also defined here,
each of which operates simultaneously on a greater number of documents. A corpal
metric operates on an entire corpus at a time, for instance to find some general property
of it. One use of this might be to compare the standard of work from different tutor
groups. A multi-dimensional metric operates on a chosen number of submissions, so a
singular metric would be 1-dimensional, a paired metric 2-dimensional and a corpal
metric n-dimensional where n is the size of the corpus. Multi-dimensional metrics might
be useful for finding clusters of similar submissions.

                     Source Code                                   Free Text
Singular          Mean number of characters per line.      Mean number of words per
Metrics
                  Proportion of 'while' loops to 'for'     sentence.
                  loops                                    Proportion of use of 'their'
                                                           compared to 'there'.
Paired            Number of keywords common to two         Number of capitalised words
Metrics
                  source code submissions.                 common to two free text
                  The length of the longest tokenisation   submissions.
                  substring common to both.                The length of the longest
                                                           substring common to both.
Multi-            The proportions of keywords common       The proportion of words from a
Dimensional
                  to a set of submissions.                 chosen group common to a set
Metrics
                                                           of submissions.
Corpal            The proportion of source code            The proportion of submissions
Metrics
                  submissions using the keyword            using the word 'hence'.
                  'while'.

                       TABLE 5.1 - Examples of Dimensional Metrics


Department of Computer Science, CUSAT                                                      7
Plagiarism Detection Techniques




Table 5.1 contains some examples of possible metrics that fall under each classification.
Examples are given for both source code and free text although some might prove to be
inappropriate for detection.



II) COMPLEXITY OF THE METRICS USED

Second classification is based on computational complexity of the methods employed to
find similarities. These groups have been named superficial metrics and structural
metrics. A superficial metric is a measure of similarity that can be gauged simply by
looking at one or more student submissions. No knowledge of the structure of a
programming language or the linguistic features of natural language is necessary. A
structural metric is a measure of similarity that requires knowledge of the structure of one
or more documents. For source code submissions this might involve a parse of the
submissions. For free text submissions this could involve reducing words to their
linguistical root word form.

                     Source Code                                    Free Text
Superficial            The count of the reserved The number of runs of five words
Metrics                keyword ‘while’.               common to two submissions.

Structural             The number of operational The size of the parse tree for a
Metrics                paths through a program.       submission.


                  TABLE 5.2 - Examples of Operational Complexity Metrics


Although these categories are fully inclusive and mutually exclusive it is impossible to
give a definition that can be consistently applied in every case. The borderline between
where a superficial metric stops and a structural metric begins is necessarily a fuzzy one.
For instance if a submission is tokenized and a superficial metric applied the whole
process could instead be thought of as just a structural metric, since tokenization is a
structure dependent process. Hence in some cases these definitions are open to individual
interpretation.


Department of Computer Science, CUSAT                                                     8
Plagiarism Detection Techniques


Most intra-corpal plagiarism detection engines work by comparing every submission with
every other possible submission, giving time complexity proportional to the square of the
                                        2
number of submissions (known as O(n ) for n submissions). This means that processing
time increases exponentially as the number of submissions grows. More computationally
efficient comparison methods may take less time, an issue which is important when
considering scalability, for instance when a free-text engine is being linked to a sizeable
database of possible sources.

Another way how to classify metrics is according to main principle build-in them, i.e, and
documents’ contents analysis is based on semantical or statistical methods. In statistical
methods there are no needs to understand the meaning of the document. A common
statistical approach is the construction of document vectors based on values describing
the document, like, the frequencies of words, compression metrics, Lancaster word pairs
and other metrics. Statistical metrics can be language-independent or language sensitive.
Purely statistical method is N-gram approach where text is characterized with sequences
of N consecutive characters. Based on statistical measures each document can be
described with so called fingerprints, where n-grams are hashed and then selected some
to be fingerprints. There can be also measures which contain probabilities.


In many cases similarity score between two documents is calculated as Euclidean
distance between document vectors. The similarity of identical documents is zero.
Similarity also can be calculated as scalar product of document vectors divided by their
lengths. This is equivalent to the cosine of the angle between two document vectors seen
from the origin. In many cases document vectors are composed from word frequency and
word weight which are automatically calculated for each document. Word frequency is
taken into account in proportion function. Symmetric or asymmetric similarity measures
are one more classification. Asymmetric similarity measures are heavy frequency vector
and heavy inclusion proportion model, which are derived from cosine function and
proportion function by combining asymmetric similarity concept with heavy frequency
vector. Asymmetric similarity measures can be used for searching subset coping. Usually
in different tools statistical methods are implemented due to their simplicity.



Department of Computer Science, CUSAT                                                    9
Plagiarism Detection Techniques




One of the most well known methods for string comparison is the Running Karp-Rabin
Matching and Greedy String Tiling (RKR-GST). The algorithm is described by Wise as a
method for comparing amino acid biosequences. Despite its origins in biology, the
method has application in. The RKR-GST algorithm appears to be the principle method
used in most commercial plagiarism detection software. This algorithm attempts to detect
the longest possible string common to both documents.With the RKR-GST algorithm it is
not necessary for the strings to be contiguous in order to be matched. This is a powerful
concept because it means that matches can be detected even if some of the text is deleted,
or if additional text has been inserted. It is possible for RKR-GST to detect matches even
when portions of multiple documents have been combined to create a patchwork of
plagiarized material. The algorithm can be further enhanced by parsing the documents to
remove trivial words and tokens.



5.2 MANUAL SEARCH OF CHARACTERISTIC PHRASES


Using this approach the instructor or examiner selects some phrases or sentences
representing core concepts of a paper. These phrases are then searched across the internet
using single or multiple search engines. Let us explain this by means of an example.
Suppose we detect the following sentence in a student’s essay
“Let us call them eAssistants. They will be not much bigger than a credit card, with a fast
processor, gigabytes of internal memory, a combination of mobile-phone, computer,
camera”
Since eAssistant is an uncommon term, it makes sense to input the term into a Google
query. Indeed if this done the query produces:
"(Maurer H., Oliver R.) The Future of PCs and Implications on Society - Let us call them
eAssistants. They will be not much bigger than a credit card, with a fast processor,
gigabytes of internal memory, a combination of...
www.jucs.org/jucs_9_4/the_future_of_pcs/Maurer_H_2.html - 34k -"




Department of Computer Science, CUSAT                                                   10
Plagiarism Detection Techniques


This proves that without further tools the student has used part of a paper published in the
Journal of Universal Computer Science. It is clear that this approach is labor intensive;
hence it is obvious that some automation will make sense.


Another method is the cataloging of past student papers. Some institutions have
maintained vaults of student composition papers cross-indexed in several ways. A teacher
who suspected plagiarism could descend into the vault (or more likely, send a teaching
assistant) to search for a paper submitted during a previous semester. Many faculty
members detect plagiarism by observing writing styles. Sometimes a paper seems to be
too professionally written to have been prepared by a student. Another clue is a sudden
shift in writing styles. Ironically, the Copy/Paste process that makes plagiarism so easy
also betrays the crime because students forget to reformat the text into a uniform font.


All of the manual methods have serious deficiencies. It is impossible to know all of the
literature on most topics that undergraduate students are likely to be writing about. The
problem is exacerbated by the growing number of informal, unpublished papers available
over the Internet.



5.3 QUIZ METHOD

The Glatt Plagiarism Screening System involves the quiz method. The program removes
words from a student’s paper and asks the student to replace the missing words. A score
is generated based on the accuracy of the student responses and the amount of time it
takes for students to complete the task.


Another method of detecting plagiarism is quizzing students about their written work. A
student who has produced their own paper should be familiar with its contents and should
be able to answer questions about it. Effective questioning can involve asking students
about ideas that were left out or rejected.




Department of Computer Science, CUSAT                                                      11
Plagiarism Detection Techniques




6. AVAILABLE TOOLS

Several applications and services exist to help academia detect intellectual dishonesty.
I have selected some of these tools which are currently particularly popular and describe
their main features in what follows.


6.1 ATTRIBUTES OF DETECTION TOOLS

According to analytical information available on the Web leader between detection tools
is Turnitin, due to its functionality. Each tool has a set of attributes that determine its
application. Two main attributes which are common to all tools are:


1) Type of text tool operates on;
2) Type of corpus tool operates on.


According to attribute “type of text tool operates on” tools can be divided into two
groups: tools that operate on non-structured (free) text and tools that operate on
structured (source code) text. In fact, detection tools are not limited to operate on free text
or source code. It may be used to find similarity in spreadsheets, diagrams, scientific
experiments, music or any other non-textual corpora.
According to attribute “type of corpus tool operates on” tools can be divided in three
groups: tools that operate only intra-corpally (where the source and copy documents are
both within a corpus), tools that operate only extra-corpally (where the copy is inside the
corpus and the source outside) and tools that operate both – intra- and extra-corpally.


6.2 Turnitin


This is a product from iParadigms [iParadigm 2006]. It is a web based service. A team of
researchers at UC Berkeley developed the computer programs in 1996 to monitor
plagiarism in undergraduate classes. A “digital portfolio” service that will provide storage
and retrieval of academic documents is a coming feature.



Department of Computer Science, CUSAT                                                       12
Plagiarism Detection Techniques




Detection and processing is done remotely. The user uploads the suspected document to
the system database. The system creates a complete fingerprint of the document and
stores it. Proprietary algorithms are used to query the three main sources: one is the
current and extensively indexed archive of Internet with approximately 4.5 billion pages,
books and journals in the ProQuest™ database; and 10 million documents already
submitted to the Turnitin database.




                                         Fig: 6.1.1


Turnitin offers different account types. They include consortium, institute, department
and individual instructor. The former account type can create later mentioned accounts
and have management capabilities. At instructor account level, teachers can create classes
and generate class enrolment passwords. Such passwords are distributed among students
when joining the class and for the submission of assignments. Figure 6.1.1 and 6.1.2
gives an idea of the system’s user-interface.

The system generates the originality report within some minutes of submission. The
report contains all the matches detected and links to original sources with color codes
describing the intensity of plagiarism. It is however not a final statement of plagiarism. A


Department of Computer Science, CUSAT                                                    13
Plagiarism Detection Techniques


higher percentage of similarities found do not necessarily mean that it actually is a case
of plagiarism one has to interpret each identified match to deduce whether it is a false
alarm or actually needs attention.




                  Figure 6.1.2: Turnitin, originality report of a submission


6.3 Glatt

Glatt Plagiarism Services founded 1987 by Dr. Barbara S. Glatt, which uses Wilson
Taylor's (1953) cloze procedure. This method is based on writing styles and patterns. In
this approach every fifth word in a suspected document is removed and the writer is
asked to fill the missing spaces. The number of correct responses and answering time is
used to calculate plagiarism probability. For example the examiner suspects that the
following paragraph is plagiarized.



Department of Computer Science, CUSAT                                                  14
Plagiarism Detection Techniques


“The proposed framework is a very effective approach to deal with information available
to any individual. It provides precise and selected news and information with a very high
degree of convenience due to its capabilities of natural interactions with users. The
proposed user modelling and information domain ontology offers a very useful tool for
browsing the information repository, keeping the private and public aspects of
information retrieval separate. Work is underway to develop and integrate seed resource
knowledge structures forming basis of news ontology and
user models using.....”
The writer is asked to take a test and fill in periodic blank spaces in text to verify the
claim of authorship. A sample test based on above paragraph is shown in figure 6.3.1.




                    Figure 6.3.1: Glatt Plagiarism Self-Detection Program




Department of Computer Science, CUSAT                                                   15
Plagiarism Detection Techniques




                         Fig 6.3.2: Glatt Plagiarism Detection Results

The percentage of correct answers can be used to determine if the writing is from the
same person or not. The result of the mentioned test is shown in figure 6.3.2. This
approach is not always feasible in academic environment where large numbers of
documents are needed to be processed, but it provides a very effective secondary layer of
detection to confirm and verify the results.



6.4 JPlag

Is an internet based service which is used to detect Plagiarisms Among a Set of
Programs. Plag is used to detect software plagiarism. It finds “similarities among
multiple sets of source code files.” Created by Guido Malpohl, JPlag currently supports
Java, C, C++, Scheme, and also natural language text. Users upload the files to be
compared and the system presents a report identifying matches.




Department of Computer Science, CUSAT                                                 16
Plagiarism Detection Techniques


Jplag takes as input a set of programs, compares these programs pairwise (computing for
each pair a total similarity value and a set of similarity regions), and provides as output a
set of HTML pages that allow for exploring and understanding the similarities found in
detail. Jplag works by converting each program into a stream of canonical tokens and
then trying to cover one such token string by substrings taken from the other (string
tiling). JPlag does programming language syntax and structure aware analysis to find
results.



6.5 WCopyfind

Developed by Lou Bloomfield, Professor of Physics, University of Virginia, this program
examines a group of documents that an instructor selects, and pulls out text portions of
those documents with matching words in phrases of a specified minimum length. The
program cannot find such "shared" phrases from documents that are "external" or those
not entered for testing. Recent versions of software can handle web documents.


Since the WCopyfind works at the string-of-text level, language is unimportant and
matches are readily identified from the candidate documents submitted for analysis. Note
that such a procedure cannot find plagiarism based on documents not submitted, for
example Web resident documents. Of course, further analysis of a small subset can be
submitted for Web-based document comparison with Google for example. In this case a
sample of the identified within-cohort plagiarized text was submitted for a Google search
and immediately revealed a source on the Web containing the same text

Figure 6.3.1, Figure 6.3.2, and Figure 6.3.3 show WCopyfind – system interface,
WCopyfind – report, and WCopyfind – document comparison, respectively.




Department of Computer Science, CUSAT                                                     17
Plagiarism Detection Techniques




                         Figure 6.3.1: WCopyfind – system interface




                                  Figure 6.3.2: WCopyfind – report



Department of Computer Science, CUSAT                                 18
Plagiarism Detection Techniques




                      Figure 6.3.3: WCopyfind – document comparison

The comparison window and “matches.txt” both list two numbers of matches. The first or
“Total Match” is the number of perfectly matching words that have been marked in the
pair of documents. The second or “Basic Match” is the number of perfectly matching
words in phrases of at least “Shortest Phrase to Match” words. That second value is
essentially the value that would have been obtained if no imperfections were allowed in
the matching. In fact, if the “Most Imperfections to Allow” parameter is set to zero,
“Total Match” and “Basic Match” will be the same. In the reports, perfect matches are
indicated by red- underlined words and bridging, But non- matching words are indicated
by green-underlined words and bridging.
Operation of plagiarism detection tools is based on statistical or semantical methods or
both to get better results. Since the statistical methods are easier to implement in
software, most of the detection tools uses statistical methods to detect plagiarism.


Department of Computer Science, CUSAT                                                  19
Plagiarism Detection Techniques




7. LIMITATIONS OF DETECTION TOOLS

The drawbacks of detection tools are:
    •   Inability to distinguish correctly cited text from plagiarized text.
    •   Books are typically not searched by these services.
    •   Detect plagiarized words, not plagiarized thoughts or ideas.
    •   Inability to process textual images for similarity checks.


Analysis of the known plagiarism detection tools shows that although these tools provide
excellent service in detecting matching text between documents, even advanced
plagiarism detection software can’t detect plagiarism so good as human does. Plagiarism
detection tools inability to distinguish correctly cited text from plagiarised text is one of
the serious drawbacks of these tools. That is why human interposition is necessary before
a paper is declare plagiarised – manual checking and human judgment are still needed



8. CONCLUSION

In the age of information technologies plagiarism has become more actual and turned into
a serious problem. In the paper ways how to reduce plagiarism are discussed. Plagiarism
prevention methods which are based on society’s change of attitude against plagiarism
without any doubt are the most significant means to fight against plagiarism, but
implementation of these methods is a challenge for society as a whole. Human brain is
universal plagiarism detection tool, which is able to analyze document using statistical
and semantical methods, is able to operate with textual and non-textual information. At
the present such abilities are not available for plagiarism detection software tools. But
nevertheless computer – based plagiarism detection tools can considerably help to find
plagiarised documents.




Department of Computer Science, CUSAT                                                     20
Plagiarism Detection Techniques




9. REFERENCES

1) Romans Lukashenko, Vita Graudina, Janis Grundspenkis, Computer Based Plagiarism
Detection Methods and Tools an Overview. Proceedings of the 2007 international
conference on Computer systems and technologies; Vol. 285
2) Lancaster T., F. Culwin. Classifications of Plagiarism detection engines. ITALICS
Vol.4(2),2005
3) Maurer, H., F. Kappe, B. Zaka. Plagiarism – A Survey. Journal of Universal Computer
Sciences, vol. 12, no. 8, pp. 1050 – 1084, 2006.
4) www.plagiarism.org
5) http://www.educause.edu/ir/library/pdf/ser07017b.pdf
6) https://www.ipd.uni-karlsruhe.de/jplag
7) http://plagiarism.phys.virginia.edu/WCopyfind%202.5.exe




Department of Computer Science, CUSAT                                              21

								
To top