How to proofread a book for Wikisource.

Proofreading is the foundation of Wikisource, providing the best quality texts in our library. The process involves two "namespaces" (sections of Wikisource; included at the start of the page title) and a special piece of software. Both together, these two namespaces (Index and Page) are sometimes called the "workspace". This is where the proofreading, editing and other "back room" processes are done.

The process is based on page scans of a physical book, usually in the form of a DjVu file. This is used to make an Index page, which is a page in the "Index" namespace with the same name as the DjVu file. Each individual page in the book is a separate page in the "Page" namespace. The Index page will link to the pages and each page needs to be proofread.

The following guide will explain how to proofread a page, with pointers to other pages with more detailed information. For a guide to the Index page portion of proofreading, see Help:Beginner's guide to Index: files.

How to proofread a page

ಸಂಪಾದಿಸಿ
Note: To get an idea about how this process works, it is a good idea to try a few pages of the current Proofread of the Month.
 
Index page, showing different status colors for each page

Proofreading is based around the Index page and all of the connected Page-namespace pages. The first step is to find one page to proofread. You will probably be starting from an "index" page. The index page shows a photo of the cover or first page. Below the picture is a list of all the page numbers.

  1. Choose a page to edit. You do not have to choose the first page or the next one in the list. The colors of the page numbers provide information:
    • Green background, such as  14 , means "validated" (finished). If there are any of these, you can look at them to get an idea of what the ultimate goal is.
    • Yellow background, such as  14 , means "proofread" (first person is done, awaiting a second person to check, or "validate" it).
    • Red background, such as  14 , means the page has been created but not fully proofread.
    • Blue background, such as  14 , means the page needs help from an expert volunteer due to complex formatting, missing images, etc. ("problematic").
    • Gray background, such as  14 , means that the page does not need proofreading because it contains no text.
    • Red number on a white background, such as  14 , means that the page has not been created yet. If you are new to proofreading, you probably want to choose one of these pages. These red links will automatically open the editing tools. Click on several pages, until you find a page that appeals to you.
  2. If you click on any of the numbers on an Index page, you will see an image of that page side-by-side with a text field. The text field may be blank or it might have been automatically filled with the text of that page.
    • If it is blank: write the text you see in the image into the text field.
    • If it is not blank: correct the text in the text field so that it matches the text in the image.
  3. Preview your work, set the status to "Proofread" (which is yellow), then save. — see Help:Proofreading and Help:Page status for more information.
    • If you have not finished proofreading the page but you want to save it, set the status to "Not proofread" (which is red).
  4. Repeat the last two steps for every page in the scan.

The side by side layout

ಸಂಪಾದಿಸಿ
 
(Fig 1) Side-by-side layout in Page namespace

When you view a page in the Page namespace, the screen will be split into two sections (fig 1). This is the default side-by-side layout that allows users to proofread the text on Wikisource (left section) against the scanned text (right section). When you click Show Preview on a page in the Page namespace, the screen will then have three sections (fig 2)where?. The text edit window and the scanned text section remain as they are, with the previewed text showing in an area above the other two sections.

To proofread a page, you should edit the text in the left section so that it matches the scan in the right section as much as possible.

You do not have to make an identical, photographic copy of the scan. Wikisource is a website, not a book, and the text is more important than the typography. You should just try to get as close as possible. Some things work in books but do not work on Wikisource. For example, if the text was originally in columns (like a newspaper), then preserving that formatting is not necessary and does not work well on Wikisource, because several pages will be added together in the main namespace when proofreading is finished. Instead, use normal paragraphs without columns, placed in the order that you would naturally read the page.

 
(Fig 3) Page status buttons

When you save the page, you should also set the page status. You should see a row of color-coded radio buttons just above the save button (fig 3). If you have just started a page with no (or not many) changes, then select the red button (for "Not proofread"). If you have completely proofread the page and corrected every error you can find, then select the yellow button (for "Proofread").

Some pages will have been proofread already by other people. You can check these and upgrade the page status. Look through the page for any remaining errors or things that need to be changed. If there are no errors, or you have fixed everything that needs to be fixed, increase the page status by one level. "Not proofread" (red) pages become "Proofread" (yellow), which become "Validated" (green). Validated pages are finished and should not need any more editing. Blank pages (gray) and Problematic pages (blue) are special cases; see below for more information.

Blank pages can be left blank and set to the "No text" (gray) page status. These pages will be ignored when pages are added to the main namespace.

This includes book covers, unless illustrated. This does not include pages with an illustration, which should be proofread as normal. If the illustration is unavailable at present, see Problematic pages.

If you have a problem while proofreading a page and cannot finish it, you can set the page status to "Problematic" (blue). This will alert other people that a problem exists, which they may be able to solve.

Common problems include pages with illustrations (if no image file is available), pages with equations, pages with foreign text (especially text that does not use the Latin alphabet) and pages with special formatting. In some of these cases, special templates exist to identify the problem (see Problem templates, below). These are useful to anyone else looking at the page and they can attract the attention of people able to fix the problem.

  • Text formatting, such as bold or italics: using '''bold''' or ''italics''.
  • Different text sizes, using {{smaller}} or {{larger}}
  • Special typography, such as:
    • Dropped or raised initials
    • Capitalisation. If the capital letters are the same size as the normal text, use {{smallcaps}}
    • Horizontal lines: {{rule}}
    • Section breaks (rows of asterisks: * * * * * )
  • Any marks or additions—including handwriting, ex libris bookplates, library stamps, stains, scratches, watermarks, dirt, etc.—that are not part of the original book. The scanning process is imperfect and will attempt to "interpret" marks on a page. Dots, holes, and smears on the original page can end up as misspellings, extra spaces, or punctuation marks (like commas, periods, and hyphens). These can be tricky, so careful examination of the original text is necessary.
  • Columns of text (e.g., like a newspaper) are not necessary. The text columns should just continue from the previous column on the page. (Do include tables and information that might be presented in columns.)
  • Do not correct spellings. Use the template {{SIC}} instead.
  • Line breaks. Webpages will normally ignore single linebreaks, so text broken into different lines (common with scanned text) will be seen normally by a reader. Line breaks can cause problems (especially with templates, links and tables, and italics/bold which are closed by the line ending) but removing them is a matter for the individual proofreader.
For example
Original "Hello," said the example. This is
an example of a broken line.
Corrected "Hello," said the example. This is an example of an unbroken line.
  • Pages that are not part of the work itself, such as adverts, do not need to be proofread or included in the main version. On the other hand, if a proofreader wants to proofread and include these pages, that is allowed.
  • Advanced typography. Creating a page that looks like the original is nice. However, the text itself is more important. Some typography can be difficult to produce and/or cause problems with the website.

Optical Character Recognition (OCR) is the function used by computers to read text. This is often saved within DjVu files and is extracted by the computer when a new page is started in proofreading. However, computers are not very good at reading printed text and errors (sometimes called "scanos") can be quite frequent. This table shows some common errors made by computers that will need to be found and corrected during proofreading.

For example
OCR error Correction
tlie the
a11, aH, aU all
au an
\vas was
mc me

Other common things to correct

ಸಂಪಾದಿಸಿ
  • Paragraph breaks. A blank line should be left between paragraphs, as standard for electronic and internet formatting.
  • Spaces before punctuation should usually be removed.
For example
OCR error foo bar ; lorem ipsum
Corrected foo bar; lorem ipsum
The space before the semicolon has been removed.

There are some templates that can be necessary when proofreading a page.

Proofreading templates

ಸಂಪಾದಿಸಿ

These should be used if there is a problem that you cannot fix yourself. When using one of these, also set the progress to "problematic" (blue).

Template Used where..
{{missing image}} ..an image should be included.
{{missing table}} ..a table should be included.
{{missing score}} ..a musical score should be included.
{{missing math formula}} ..a mathematical formula should be included.
{{illegible}} ..the text cannot be read.
{{arabic missing}} ..Arabic characters are used.*
{{chinese missing}} ..Chinese characters are used.*
{{greek missing}} ..Greek characters are used.*
{{hebrew missing}} ..Hebrew characters are used.*
{{russian missing}} ..Cyrillic characters are used.*
{{symbol missing}} ..unknown symbols are used.
* Where you cannot read or write in these languages.