AstroAsciiData_history - SciPy wiki dump

1. Introduction

ASCII tables are one of the major data exchange formats used in science. In astronomy we use ASCII tables e.g. for object lists, line lists or even spectra.

Every person working with astronomy data has to deal with ASCII data, however there are various ways to do so. Some use the awk scripting language, some transfer the ASCII tables to FITS tables and then work on the FITS data, some use IDL routines. Most of those approaches need individual efforts (such as preparing a format file for the transformation to FITS) whenever there is a new kind of ASCII table with e.g. a different number of columns.

Within the AstroAsciiData project we envision a class which easily can be used to work on all kinds of ASCII tables. The class should provide a convenient tool such that the user can easily:

read in ASCII tables
manipulate table elements
save the modified ASCII table
combine several tables
delete/add rows and columns

The AstroAsciiData class may be used interactively, within small scripts, in data reduction tasks and even in data bases.

In general, the ASCII tables used in astronomy have a rather small size. As an example, the size of the Wide Field Camera catalogue of Hubble Ultra Deep Field is 2.2MB. Handling those amounts of data is not a time consuming task for modern day computers. As a consequence, computational speed is not a prime issue in software design and construction. The focus is rather to maximize the convenience for the user, such that using the class requires only to overcome a small threshold.

2. Basic Structure

The whole project is implemented as a single module "asciidata" in the file "asciidata.py". In the module there are a few functions and a couple of classes. There is a central class with the name "AsciiData"

All function exclusively work on the central class "AsciiData". After creating an object instance "AsciiData" from an ascii table, the user uses its class methods to work on the object. All other classes are used within the main class "AsciiData". The user does neither need direct access to those classes nor work directly on instances of those classes.

3. The ASCII Table

The primary purpose of the module is to provide a module to work with ASCII tables, which means text files with tabulated ASCII characters as content. To motivate users applying the asciidata module for their purpose, there should be only minimal requirements for the input ascii tables to work properly.

Those requirements are:

each row in the ascii table must be formatted in an identical way
the first characters of a column can be blank the last character must not be blank
the first characters of a column can be blank the last character must not be blank
each row must contain the same sequence of data types
there must be at leas one row without any NULL entry
every column element of the ascii table must reflect the size of the data type needed in the entire column (can be critical for integers)

The asciidata module allows the ascii table to:

contain blank lines, which simply are neglected
comment lines with '#' as the first non-blank character
contain NULL entries marked with '*', 'Null', 'NULL' or 'None'

Examples

a) foo1.txt

the textbook example:

   #
   # This is an exiting table!
   #
   1 abc 5.0
   2 def 6.0
   3 ghi 7.0

b) foo2.txt

not nice, but works nevertheless::

   #
   # This is an exiting table!
   
   #
   1 abc 5.0
   #2 def 6.0
   
   3 ghi 7.0

c) foo3.txt

this can cause problems::

   #
   # This is an exiting table!
   #
   1 abc 5.0
   2 def 6.0
   30 ghi 7.0

d) foo4.txt

impossible::

   #
   # This is an exiting table!
   #
   1 abc   5
   2 def 6.0
   3 ghi 7.0

4. Module Fuctions

The number of functions in the module should be minimal. In fact, the only function which is immediately needed is the function 'open(*args)'. Later on there may be requirements for other functions.

open(filename=None, mode):

   Input
       filename - the filename of the ACII table
       mode     - the acces mode of the ASCII table

   Return
       asciitable - an instance of the AsciiData class

   Description
       This function is the only and main function to create a new AsciiData
       class instance. The function can be kept quite slim, only a frontend
       to the class generator is just enough.

5. Module Classes

5.1 AsciiData

This is the central class of the module, and so far the only public class. The user will mostly use public methods of this class to work with ASCII tables.

5.1.1 Class Design

It is rather straightforward to design the AsciiData-class such that there exists a "private" class to represent the table columns. Such a column class consists of the actual data as well as the description of the column (or data), which means the data type, format and so on.

Storing the data columnwise means the AsciiData class is 'just' a collection of column instances, and its methods administrate the column collection. With respect to data access and data modification the AsciiData class is then "only" a frontend to the respective methods of the column class.

5.1.2 Class Methods

init(filename=None, mode)

This method reads in an ascii table from the the file and intitializes the instance.

safe(self)

The method saves the asciidata object back to the file it was read from. Any changes made to the instance (deleted/modified elements) after the initialization will be preserved, of course.

safeas(self, filename=None, overwrite=False)

The method writes the instance to a file with a filename different than the original it was read from.

====== get_nrows(self) ===== The method returns the number of rows in the object instance.

====== get_ncols(self) ===== The method returns the number of columns in the object instance.

====== get_element(self, colname, colnum, rownum) ===== The method returns the specified element from the instance.

====== set_element(self, colname, colnum, rownum, value) ===== The method set the specified element of the instance to a given value.

get_column(self, colname, colnum)

The method returns all entries of the specified column as a coherent structure (e.g. numarray object or python list).

===== get_row(self, rownum) ====== The method returns all entries of the specified row as a coherent structure (e.g. numarray object or python list).

del_column(self, colname, colnum)

The method deletes the specified column.

del_row(self, rownum)

The method deletes the specified row.

read_column(self, colname, colnum)

The method prints the specified column onto the screen.

read_row(self, rownum)

The method prints the specified row onto the screen.

create_column(self, *specs)

The method creates a new, empty column for the current instance. Which parameters are needen remains to be seen.

5.2 AsciiColumn

5.2.1 Class Design

Two important items characterize the column class:

the data plus its container to store it
the column information such as data type, format and so on

For the data storage there exist several possible containers:

generic python list objects ([1,2,3,4...]), either in general as string type or as different types (float, string, boolean)
numarray objects

python list

pros:

   extremely flexible container, which also is easy to handle

cons:

    the absence of native types requires a lot of dicipline
    not to mess up the types. Because of the absence of types,
    there are also not native checks on element operations.
    Probably this container is also bad in terms of performance
    and RAM use. The native python float format is limited
    in accuracy.

numarray object

pros:

      They are would be used what they were made for, which is containing
      data of various type. This means that they should be
      fast, economical in RAM consumption and with a lot of intrinsic
      type checking. There exists a wide range of predefined
      types

cons:

      They  have an immutable, fixed size chosen when
      they are created. Enlarging means copying to a larger
      instance. There is also no NULL entry, which definitely
      exist in ASCII tables, and must be represented
      in the AsciiData class .

5.2.2 Class Methods

All of the class methods are private and accessed only through the interfaces of the main class.

init(header_info, data)

The initializer of the class. What there will be behind "header_info" and "data" as parameters remains to be seen.

_get_element(self, rownum)

The method returns the specified element.

_set_element(self, rownum, value)

The method sets the specified element to the given value.

_print_element(self, rownum)

The method returns a well formated string representation of the specified element.

_print_all(self)

The method returns a well formated string representation of all elements in the column.

6. User Examples

6.1 Example 1

Problem: The user wants to plot the colour-magnitude distribution in (F606W - F850LP) vs. F850LP for all sources in the HUDF catalogue.
Solution:
- the HUDF catalogue is read in, an AsciiData instance is created
- create a new column for the colour data
- go over each row, compute and store the colour values
- extract the data from the F850LP- and the colour-column
- send the data to you favourite plotting program

6.2 Example 2

Problem: The user wants to create a subcatalogue from the main HUDF catalogue. All entries in the subcatalogue must satisfy a certain selection criteria (e.g. various coulour criterias).
Solution:
- the HUDF catalogue is read in, an AsciiData instance is created
- go over each row and check whether the data in the row satisfy your criteria
- if the criteria are NOT met, delete the row
- save the remaining entries to a different file

6.3 Example 3

* Problem: The users wants to find averages and standard deviations of a column or a function of column values.

Solution:
- Read in the ASCII table
- go over each row
- read the value in the column of interest
- compute the mean value
- go again over each row
- compute the standard deviation

7. Outlook

The following topics touch items which can be neglected in the first release or will perhaps never materialize in the project. Nevertheless the discussion whether those items are imprtant can be started just now, also their later implementation might just be taken into acount during the design phase now.

7.1 Subclasses for Particular Ascii Data

There exist well known ascii tables used in astronomy. The textbook example is the SExtractor table. Information on the column formats and the column types are in the header. We might want to derive subclasses from the base-class AsciiData which fully take into account the peculiarities of those ascii tables.

A good starting point is perhaps to collect which special ascii tables do exist. After having an overview, it would be easier to decide whether this is a direction to follow.

7.2 Special Ascii Format for Output Data

Whenever the module loads in an ascii table of unknown type, an important task will be to find out the length of the columns, which data type is therein, which format does the data have and so on.

We could introduce a special format for ascii tables where all the available information is stored in the header. If the module would encounter such an ascii table, the header information could be evaluated and the intitial checking would be avoided.

7.3 Reserved Columns for Selection and Sorting

A nice item would be to introduce sorting or selection functions to the instances. This could nicely be implemented by introducing a hidden selection and sorting column.

The selection column contains the boolean whether the row (or column) is selected or not. Methods such as "save()" or "print()" could then consider only the entries with the selection set. The sorting column would contain the sorted row order. This would allow to address sorted elements without automatically reshuffle the whole data for each sorting process.

From perrygreenfield Fri Feb 25 11:32:14 -0600 2005 From: perrygreenfield Date: Fri, 25 Feb 2005 11:32:14 -0600 Subject: options to consider Message-ID: <20050225113214-0600@www.scipy.org>

Regarding acceptable formats for ascii tables:

I think some more specifics about what are and aren't acceptable formats should be outlined, and some thought given to perhaps later expanding what is acceptable. Some of the examples of what I'm mentioning don't need to be supported right away (indeed, I think it is far more important to get more minimal capability out there early than to try to solve all problems up front and greatly delay having anything usable) and can be added later.

More flexibility on what constitutes lines to be ignored. (An option to the constructor, i.e., the open function, could be a regular expression that is used to determine if a line is to be ignored, with some predefined constants representing some standard cases).
Since the design document refers to column names (It wasn't clear to me where these come from), it would be nice to eventually provide a means of specifying how to pick these up from an ascii table too (e.g., a regular expression for identifying a column name row with some standard cases built in).
I can almost guarantee that people will want more flexibility in how entries are listed, in particular with regard to delimiters (by column position, i.e., fixed format, or commas, or spaces, etc). This could also be specifiable in the open statement.
Numarray's constructors do allow using mixed types for constructing an array. Perhaps some future consideration can be given to using this as a means of relaxing the restriction on all column elements having to have the same numeric type (I imagine cases will arise when one sees a mix of integer and floats in one column)
Some flexibility as to what constitute a NULL value will probably be needed. (list of values per

type?)

Regarding the public interface for the table object:

It would be crazy not to make use of the getitem and setitem machinery that Python provides rather than users using get_row and get_column methods. Using each one in isolation is very simple, it's the combination of the two that introduces some confusion. This issue has been discussed at some length for recarray and it probably makes sense to review what consensus (or as best could be agreed on) that was developed for it (though we haven't had time yet to implement the new features agreed on). Since they are doing many of the same things, it would seem sensible to share the same interface for common capabilities. Look the following links over as well as the existing interface to see what you think.
Likewise, it may be nice to use the same file writing methods that PyFITS uses so that users have some consistency (e.g., writeto and flush instead of saveas and save, though those may not have been the best names (aliasing is also a consideration, i.e., providing more than one name, but since you are starting from scratch, that probably should be avoided).
I'm not sure I understand why the arguments to the get_element and related methods specify both column name and number. It would seem that only one of these is needed.
If dynamically resizing tables is to be part of the interface, should there not be an insert_row method?
I'd be very surprised if most people would not prefer to get an array back when asking for a column. (how to deal with resizing I'll address below)

Regarding the underlying data structures. If one wanted this class to primarily act as a means of reading tables into array variables, it would be fairly simple. But if one wants to manipulate the table as an entity rather than independent arrays, then, as you note, things get more complex. Recarray does handles many of these issues. The only one it doesn't handle is dynamic changes to the structure of the table itself. But that doesn't mean it couldn't be used as a basis for such a class. Just like Python lists, what really goes on behind the scenes doesn't have to match what people think it does. Python lists really use arrays as I understand, and when more space is needed, it reallocates the array to one twice as big (or some factor) and copies the object references. No reason the same approach can't be done here. Just make the underlying recarray bigger than needed. That makes adding rows and columns low cost until the size limit is reached, then one just builds a larger table with more padding. Insertions are always more expensive since they mean moving blocks of data around. Something to consider.

Regarding NULL values. Use of arrays will require mask arrays to handle null values. Numarray has masked arrays, or one could just have the table have associated mask columns. For floating point, NaNs could be used, but unfortunately these cannot be used for ints or strings. If you want to use arrays for columns, this issue must be addressed. If you want to use lists, then the lack of arithmetic capability is likely to be an annoyance for users (they will have to explicitly convert their lists to arrays). It's hard to have it both ways on this (in any environment).

From davidabreu Fri May 27 06:20:35 -0500 2005 From: davidabreu Date: Fri, 27 May 2005 06:20:35 -0500 Subject: SExtractor ascii files Message-ID: <20050527062035-0500@www.scipy.org>

I have some code to read SExtrator like catalogs into dictionaries and then access to the columns by name. I don't know if this is the proper site to talk about this or if there is a developers forum where I can contribute.

From laidler Fri Jun 3 15:22:14 -0500 2005 From: laidler Date: Fri, 03 Jun 2005 15:22:14 -0500 Subject: SExtractor ascii files Message-ID: <20050603152214-0500@www.scipy.org> In-reply-to: <20050527062035-0500@www.scipy.org>

I think this is a good place to talk about it, at least for now. This sounds extremely useful to me. What format are the columns stored in? (Lists, numarrays, something else?) Can you post a snippet of sample code demonstrating its use?

From davidabreu Mon Jun 6 09:41:43 -0500 2005 From: davidabreu Date: Mon, 06 Jun 2005 09:41:43 -0500 Subject: SExtractor ascii files Message-ID: <20050606094143-0500@www.scipy.org> In-reply-to: <20050603152214-0500@www.scipy.org>

the code reads the columns in float numarrays if the valuers are float, integer numarray if they are integer and list if the values are strings. I want to write a class but now I only have a function to read and several to write in diferent formats (html, latex, ...)

From jhatchell Sat Sep 17 10:26:56 -0500 2005 From: jhatchell Date: Sat, 17 Sep 2005 10:26:56 -0500 Subject: output as Python dictionaries Message-ID: <20050917102656-0500@www.scipy.org>

There are often times where the output you want from an ascii file is a set of dictionaries indexed on one of the columns eg. RA(source) indexed on sourcename, in fact I have a class Cat to do this. This isn't trivial to reproduce from a set of numarray arrays. It would be nice to have a method which returned dictionaries given the column in which to find the keys and a set of column names.

From davidabreu Mon Sep 26 03:16:17 -0500 2005 From: davidabreu Date: Mon, 26 Sep 2005 03:16:17 -0500 Subject: output as Python dictiofirst release sexcat projectnaries Message-ID: <20050926031617-0500@www.scipy.org> In-reply-to: <20050917102656-0500@www.scipy.org>

sexcat.sourceforge.net

From davidabreu Mon Sep 26 03:34:36 -0500 2005 From: davidabreu Date: Mon, 26 Sep 2005 03:34:36 -0500 Subject: First release sexcat project Message-ID: <20050926033436-0500@www.scipy.org>

http://sexcat.sourceforge.net/

SExCaT: SExtractor Catalogues Tool

From harry_ferguson Wed Nov 16 20:34:31 -0600 2005 From: harry_ferguson Date: Wed, 16 Nov 2005 20:34:31 -0600 Subject: Make simple things simple Message-ID: <20051116203431-0600@www.scipy.org>

Whatever asciidata does, it ought to be blindingly simple to do the obvious: read in column number N from a white-space delimited table.

I have a tool that I wrote a couple years ago that I have found to be very useful and reasonably bullet-proof. To read columns 1,3 and 5 into x,y, and z from file 'foo.dat', only one line is required:

x,y,z = fgetcols('foo.dat',1,2,3)

The returned values are numarrays of Int32 or Float64 or lists. The program decides which and is clever enough that it is hard (but not impossible, I think) to fool it.

As Perry mentions, most tables are small, so the current version of this program does some type checking on every row...meaning that it has no problem with the "impossible" example 4 above (although as written would not like the file in example 3).

Anyway...I would advocate that the module have some "one-line" function that will open the file, read selected columns in to numarrays and lists, and close it back up.