Contents - Index - Previous - Next


Filter a file


Filter a File lets you strip unwanted text or markup out of a file.

It lets you use Concordance's text processing power to make a new version of your text file instead of making a concordance.

You could use it, for example, to strip all the HTML (web) mark-up out of a file, leaving only plain text; or to strip out references, comments, and other mark-up which you may have added for use with Concordance.

It can also be useful for understanding how Concordance is treating your text if you are having problems choosing the correct options.

Any warnings or other messages when filtering the file are shown in the Progress dialog.  When the new filtered file has been created, it is displayed in the File Viewer.

Example: Stripping HTML 

To remove all HTML (web) markup from a file, leaving only the text content, go to the Text menu, choose Ignore, and for Skip Markers 1, enter < as the opening marker and > as the closing marker.  Then choose Filter a File.  

In this case, the same result could be achieved by opening the References dialog on the Text menu, defining  <  as the opening reference marker and  >  as the closing reference marker, then choosing Filter a File.  This is because Filter a File, like Concordance's usual text processing, strips out anything between reference markers.

To remove HTML entities (special characters) as well as HTML markup between angle brackets, tick Convert HTML entities found in input in the Preferences dialog before filtering your file.

Details

Here in detail is what Filter a File does.

It reads your input file and writes a new output file (named automatically by adding the extension .Filtered to the name of your input file), performing the following actions:

1. Warns of long lines in the input 

2. If you have ticked Convert HTML entities found in input in the Preferences dialog, HTML entities will be translated to ANSI characters where those are defined in your file HTMLtoLatin1.ini.  (Step 9 below, if selected, will undo this, so there is no point in selecting both for filtering a file.)

3. Removes any text you have defined in the Ignore dialog on the Text menu (that is, text between skip markers or text from selected positions on the line, or both) 

4. Removes any control characters found in the file 

5. Trims any leading and trailing spaces in each line 

6. Removes characters not defined in your Alphabet, unless you have chosen the option in the Alphabet dialog which causes such characters to be added automatically to your alphabet. For more information, see the Alphabet dialog on the Text menu.

7. Translates OEM (DOS) characters to ANSI (Windows) characters if you have selected that option in the Alphabet dialog on the Text menu

8. Removes any text found between reference opening markers and reference closing markers, if any are defined in the References dialog on the Text menu.

9. If you have ticked Convert to HTML entities during output in the Preferences dialog, ANSI characters will be translated to HTML entities where those are defined in your file Latin1toHTML.ini.  (Step 2 above, if selected, will do the reverse of this, so there is no point in selecting both for filtering a file.)