Ads 468x60px

Thursday, 2 July 2015

Typed pointers and casting

Typed pointers and casting


In many languages, pointers have the additional restriction that the object they point to has a specific type. For example, a pointer may be declared to point to an integer; the language will then attempt to prevent the programmer from pointing it to objects which are not integers, such as floating-point numbers, eliminating some errors.
For example, in C
int *money;
char *bags;
money would be an integer pointer and bags would be a char pointer. The following would yield a compiler warning of "assignment from incompatible pointer type" under GCC
bags = money;
because money and bags were declared with different types. To suppress the compiler warning, it must be made explicit that you do indeed wish to make the assignment by typecasting it
bags = (char *)money;
which says to cast the integer pointer of money to a char pointer and assign to bags.
A 2005 draft of the C standard requires that casting a pointer derived from one type to one of another type should maintain the alignment correctness for both types (6.3.2.3 Pointers, par. 7):
char *external_buffer = "abcdef";
int *internal_data;

internal_data = (int *)external_buffer;  // UNDEFINED BEHAVIOUR if "the resulting pointer
                                         // is not correctly aligned"
In languages that allow pointer arithmetic, arithmetic on pointers takes into account the size of the type. For example, adding an integer number to a pointer produces another pointer that points to an address that is higher by that number times the size of the type. This allows us to easily compute the address of elements of an array of a given type, as was shown in the C arrays example above. When a pointer of one type is cast to another type of a different size, the programmer should expect that pointer arithmetic will be calculated differently. In C, for example, if the money array starts at 0x2000 and sizeof(int) is 4 bytes whereas sizeof(char) is 1 bytes, then (money+1) will point to 0x2004 but(bags+1) will point to 0x2001. Other risks of casting include loss of data when "wide" data is written to "narrow" locations (e.g. bags[0]=65537;), unexpected results when bit-shifting values, and comparison problems, especially with signed vs unsigned values.
Although it is impossible in general to determine at compile-time which casts are safe, some languages store run-time type information which can be used to confirm that these dangerous casts are valid at runtime. Other languages merely accept a conservative approximation of safe casts, or none at all.

Making pointers safer

As a pointer allows a program to attempt to access an object that may not be defined, pointers can be the origin of a variety of programming errors. However, the usefulness of pointers is so great that it can be difficult to perform programming tasks without them. Consequently, many languages have created constructs designed to provide some of the useful features of pointers without some of their pitfalls, also sometimes referred to as pointer hazards. In this context, pointers that directly address memory (as used in this article) are referred to as raw pointers, by contrast with smart pointers or other variants.
One major problem with pointers is that as long as they can be directly manipulated as a number, they can be made to point to unused addresses or to data which is being used for other purposes. Many languages, including most functional programming languages and recent imperative languages like Java, replace pointers with a more opaque type of reference, typically referred to as simply a reference, which can only be used to refer to objects and not manipulated as numbers, preventing this type of error. Array indexing is handled as a special case.
A pointer which does not have any address assigned to it is called a wild pointer. Any attempt to use such uninitialized pointers can cause unexpected behavior, either because the initial value is not a valid address, or because using it may damage other parts of the program. The result is often a segmentation fault, storage violation or wild branch (if used as a function pointer or branch address).
In systems with explicit memory allocation, it is possible to create a dangling pointer by deallocating the memory region it points into. This type of pointer is dangerous and subtle because a deallocated memory region may contain the same data as it did before it was deallocated but may be then reallocated and overwritten by unrelated code, unknown to the earlier code. Languages with garbage collection prevent this type of error because deallocation is performed automatically when there are no more references in scope.
Some languages, like C++, support smart pointers, which use a simple form of reference counting to help track allocation of dynamic memory in addition to acting as a reference. In the absence of reference cycles, where an object refers to itself indirectly through a sequence of smart pointers, these eliminate the possibility of dangling pointers and memory leaks.Delphi strings support reference counting natively.

Null pointer

Main article: Null pointer
null pointer has a value reserved for indicating that the pointer does not refer to a valid object. Null pointers are routinely used to represent conditions such as the end of a list of unknown length or the failure to perform some action; this use of null pointers can be compared to nullable types and to the Nothing value in an option type.

Autorelative pointer

The term autorelative pointer may refer to a pointer whose value is interpreted as an offset from the address of the pointer itself; thus, if a data structure, M, has an autorelative pointer member, p, that points to some portion of M itself, then Mmay be relocated in memory without having to update the value of p.
The cited patent also uses the term self-relative pointer to mean the same thing. However, the meaning of that term has been used in other ways:
  • It is often used to mean an offset from the address of a structure rather than from the address of the pointer itself.
  • It has been used to mean a pointer containing its own address, which can be useful for reconstructing in any arbitrary region of memory a collection of data structures that point to each other.

Based pointer

based pointer is a pointer whose value is an offset from the value of another pointer. This can be used to store and load blocks of data, assigning the address of the beginning of the block to the base pointer.

Multiple indirection

In some languages, a pointer can reference another pointer, requiring multiple dereference operations to get to the original value. While each level of indirection may add a performance cost, it is sometimes necessary in order to provide correct behavior for complex data structures. For example, in C it is typical to define a linked list in terms of an element that contains a pointer to the next element of the list:
struct element
{
    struct element * next;
    int              value;
};

struct element * head = NULL;
This implementation uses a pointer to the first element in the list as a surrogate for the entire list. If a new value is added to the beginning of the list, head has to be changed to point to the new element. Since C arguments are always passed by value, using double indirection allows the insertion to be implemented correctly, and has the desirable side-effect of eliminating special case code to deal with insertions at the front of the list:
// Given a sorted list at *head, insert the element item at the first
// location where all earlier elements have lesser or equal value.
void insert(struct element **head, struct element *item)
{
    struct element ** p;  // p points to a pointer to an element

    for (p = head; *p != NULL; p = &(*p)->next)
    {
        if (item->value <= (*p)->value)
            break;
    }
    item->next = *p;
    *p = item;
}

// Caller does this:
insert(&head, item);
In this case, if the value of item is less than that of head, the caller's head is properly updated to the address of the new item.
A basic example is in the argv argument to the main function in C (and C++), which is given in the prototype as char **argv – this is because the variable argv itself is a pointer to an array of strings (an array of arrays), so *argv is a pointer to the 0th string (by convention the name of the program), and **argv is the 0th character of the 0th string.

Function pointer

In some languages, a pointer can reference executable code, i.e., it can point to a function, method, or procedure. Afunction pointer will store the address of a function to be invoked. While this facility can be used to call functions dynamically, it is often a favorite technique of virus and other malicious software writers.
    int  a, b, x, y;
    int  sum(int n1, int n2);   // Function with two integer parameters returning an integer value
    int  (*fp)(int, int);       // Function pointer which can point to a function like sum

    fp = &sum;                  // fp now points to function sum
    x = (*fp)(a, b);            // Calls function sum with arguments a and b
    y = sum(a, b);              // Calls function sum with arguments a and b

Dangling pointer

Main article: Dangling pointer
dangling pointer is a pointer that has not been initialized (that is, a wild pointer has not had any address assigned to it) and may make a program crash or behave oddly. In the Pascal or C programming languages, pointers that are not specifically initialized may point to unpredictable addresses in memory.
The following example code shows a dangling pointer:
int func(void)
{
    char *p1 = malloc(sizeof(char)); /* (undefined) value of some place on the heap */
    char *p2;       /* wild (uninitialized) pointer */
    *p1 = 'a';      /* This is OK, assuming malloc() has not returned NULL. */
    *p2 = 'b';      /* This invokes undefined behavior */
}
Here, p2 may point to anywhere in memory, so performing the assignment *p2 = 'b' can corrupt an unknown area of memory or trigger a segmentation fault.

Wild branch

Where a pointer is used as the address of the entry point to a program or start of a subroutine and is also either uninitialized or corrupted, if a call or jump is nevertheless made to this address, a "wild branch" is said to have occurred. The consequences are usually unpredictable and the error may present itself in several different ways depending upon whether or not the pointer is a "valid" address and whether or not there is (coincidentally) a valid instruction (opcode) at that address. The detection of a wild branch can present one of the most difficult and frustrating debugging exercises since much of the evidence may already have been destroyed beforehand or by execution of one or more inappropriate instructions at the branch location. If available, an instruction set simulator can usually not only detect a wild branch before it takes effect, but also provide a complete or partial trace of its history.

Simulation using an array index

It is possible to simulate pointer behavior using an index to an (normally one-dimensional) array.
Primarily for languages which do not support pointers explicitly but do support arrays, the array can be thought of and processed as if it were the entire memory range (within the scope of the particular array) and any index to it can be thought of as equivalent to a general purpose register in assembly language (that points to the individual bytes but whose actual value is relative to the start of the array, not its absolute address in memory). Assuming the array is, say, a contiguous 16megabyte character data structure, individual bytes (or a string of contiguous bytes within the array) can be directly addressed and manipulated using the name of the array with a 31 bit unsigned integer as the simulated pointer (this is quite similar to the C arrays example shown above). Pointer arithmetic can be simulated by adding or subtracting from the index, with minimal additional overhead compared to genuine pointer arithmetic.
It is even theoretically possible, using the above technique, together with a suitable instruction set simulator to simulate anymachine code or the intermediate (byte code) of any processor/language in another language that does not support pointers at all (for example Java / JavaScript). To achieve this, the binary code can initially be loaded into contiguous bytes of the array for the simulator to "read", interpret and action entirely within the memory contained of the same array. If necessary, to completely avoid buffer overflow problems, bounds checking can usually be actioned for the compiler (or if not, hand coded in the simulator).

Support in various programming languages

Ada

Ada is a strongly typed language where all pointers are typed and only safe type conversions are permitted. All pointers are by default initialized to null, and any attempt to access data through a null pointer causes an exception to be raised. Pointers in Ada are called access types. Ada 83 did not permit arithmetic on access types (although many compiler vendors provided for it as a non-standard feature), but Ada 95 supports “safe” arithmetic on access types via the packageSystem.Storage_Elements.

BASIC

Several old versions of BASIC for the Windows platform had support for STRPTR() to return the address of a string, and for VARPTR() to return the address of a variable. Visual Basic 5 also had support for OBJPTR() to return the address of an object interface, and for an ADDRESSOF operator to return the address of a function. The types of all of these are integers, but their values are equivalent to those held by pointer types.
Newer dialects of BASIC, such as FreeBASIC or BlitzMax, have exhaustive pointer implementations, however. In FreeBASIC, arithmetic on ANY pointers (equivalent to C's void*) are treated as though the ANY pointer was a byte width. ANY pointers cannot be dereferenced, as in C. Also, casting between ANY and any other type's pointers will not generate any warnings.
dim as integer f = 257
dim as any ptr g = @f
dim as integer ptr i = g
assert(*i = 257)
assert( (g + 4) = (@f + 1) )

C and C++

In C and C++ pointers are variables that store addresses and can be null. Each pointer has a type it points to, but one can freely cast between pointer types (but not between a function pointer and non-function pointer type). A special pointer type called the “void pointer” allows pointing to any (non-function) variable type, but is limited by the fact that it cannot be dereferenced directly. The address itself can often be directly manipulated by casting a pointer to and from an integral type of sufficient size, though the results are implementation-defined and may indeed cause undefined behavior; while earlier C standards did not have an integral type that was guaranteed to be large enough, C99 specifies the uintptr_t typedef name defined in <stdint.h>, but an implementation need not provide it.
C++ fully supports C pointers and C typecasting. It also supports a new group of typecasting operators to help catch some unintended dangerous casts at compile-time. Since C++11, the C++ standard library also provides smart pointers(unique_ptrshared_ptr and weak_ptr) which can be used in some situations as a safe alternative to primitive C pointers. C++ also supports another form of reference, quite different from a pointer, called simply a reference or reference type.
Pointer arithmetic, that is, the ability to modify a pointer's target address with arithmetic operations (as well as magnitude comparisons), is restricted by the language standard to remain within the bounds of a single array object (or just after it), and will otherwise invoke undefined behavior. Adding or subtracting from a pointer moves it by a multiple of the size of thedatatype it points to. For example, adding 1 to a pointer to 4-byte integer values will increment the pointer by 4. This has the effect of incrementing the pointer to point at the next element in a contiguous array of integers—which is often the intended result. Pointer arithmetic cannot be performed on void pointers because the void type has no size, and thus the pointed address can not be added to, although gcc and other compilers will perform byte arithmetic on void* as a non-standard extension. For working "directly" with bytes they usually cast pointers to BYTE*, or unsigned char* if BYTE is not defined in the standard library used.
Pointer arithmetic provides the programmer with a single way of dealing with different types: adding and subtracting the number of elements required instead of the actual offset in bytes. (though the char pointer, char being defined as always having a size of one byte, allows the element offset of pointer arithmetic to in practice be equal to a byte offset) In particular, the C definition explicitly declares that the syntax a[n], which is the n-th element of the array a, is equivalent to *(a+n), which is the content of the element pointed by a+n. This implies that n[a] is equivalent to a[n], and one can write, e.g., a[3] or 3[a] equally well to access the fourth element of an array a.
While powerful, pointer arithmetic can be a source of computer bugs. It tends to confuse novice programmers, forcing them into different contexts: an expression can be an ordinary arithmetic one or a pointer arithmetic one, and sometimes it is easy to mistake one for the other. In response to this, many modern high-level computer languages (for example Java) do not permit direct access to memory using addresses. Also, the safe C dialect Cyclone addresses many of the issues with pointers. See C programming language for more discussion.
The void pointer, or void*, is supported in ANSI C and C++ as a generic pointer type. A pointer to void can store an address to any non-function data type, and, in C, is implicitly converted to any other pointer type on assignment, but it must be explicitly cast if dereferenced inline. K&R C used char* for the “type-agnostic pointer” purpose (before ANSI C).
int x = 4;
void* q = &x;
int* p = q;  /* void* implicitly converted to int*: valid C, but not C++ */
int i = *p;
int j = *(int*)q; /* when dereferencing inline, there is no implicit conversion */
C++ does not allow the implicit conversion of void* to other pointer types, even in assignments. This was a design decision to avoid careless and even unintended casts, though most compilers only output warnings, not errors, when encountering other ill casts.
int x = 4;
void* q = &x;
// int* p = q; This fails in C++: there is no implicit conversion from void*
int* a = (int*)q; // C-style cast
int* b = static_cast<int*>(q); // C++ cast
In C++, there is no void& (reference to void) to complement void* (pointer to void), because references behave like aliases to the variables they point to, and there can never be a variable whose type is void.

C#

In the C# programming language, pointers are supported only under certain conditions: any block of code including pointers must be marked with the unsafe keyword. Such blocks usually require higher security permissions than pointerless code to be allowed to run. The syntax is essentially the same as in C++, and the address pointed can be either managed orunmanaged memory. However, pointers to managed memory (any pointer to a managed object) must be declared using the fixed keyword, which prevents the garbage collector from moving the pointed object as part of memory management while the pointer is in scope, thus keeping the pointer address valid.
An exception to this is from using the IntPtr structure, which is a safe managed equivalent to int*, and does not require unsafe code. This type is often returned when using methods from the System.Runtime.InteropServices, for example:
// Get 16 bytes of memory from the process's unmanaged memory
IntPtr pointer = System.Runtime.InteropServices.Marshal.AllocHGlobal(16);

// Do something with the allocated memory

// Free the allocated memory
System.Runtime.InteropServices.Marshal.FreeHGlobal(pointer);
The .NET framework includes many classes and methods in the System and System.Runtime.InteropServicesnamespaces (such as the Marshal class) which convert .NET types (for example, System.String) to and from manyunmanaged types and pointers (for example, LPWSTR or void *) to allow communication with unmanaged code.

COBOL

The COBOL programming language supports pointers to variables. Primitive or group (record) data objects declared within the LINKAGE SECTION of a program are inherently pointer-based, where the only memory allocated within the program is space for the address of the data item (typically a single memory word). In program source code, these data items are used just like any other WORKING-STORAGE variable, but their contents are implicitly accessed indirectly through their LINKAGEpointers.
Memory space for each pointed-to data object is typically allocated dynamically using external CALL statements or via embedded extended language constructs such as EXEC CICS or EXEC SQL statements.
Extended versions of COBOL also provide pointer variables declared with USAGE IS POINTER clauses. The values of such pointer variables are established and modified using SET and SET ADDRESS statements.
Some extended versions of COBOL also provide PROCEDURE-POINTER variables, which are capable of storing theaddresses of executable code.

PL/I

The PL/I language provides full support for pointers to all data types (including pointers to structures), recursion,multitasking, string handling, and extensive built-in functions. PL/I was quite a leap forward compared to the programming languages of its time

D

The D programming language is a derivative of C and C++ which fully supports C pointers and C typecasting.

Eiffel

The Eiffel object-oriented language employs value and reference semantics without the need for pointer arithmetics. Nevertheless, pointer classes are provided. They offer pointer arithmetics, typecasting, explicit memory management, interfacing with non-Eiffel software, and other features.

Fortran

Fortran-90 introduced a strongly typed pointer capability. Fortran pointers contain more than just a simple memory address. They also encapsulate the lower and upper bounds of array dimensions, strides (for example, to support arbitrary array sections), and other metadata. An association operator=> is used to associate a POINTER to a variable which has aTARGET attribute. The Fortran-90 ALLOCATE statement may also be used to associate a pointer to a block of memory. For example, the following code might be used to define and create a linked list structure:
type real_list_t
  real :: sample_data(100)
  type (real_list_t), pointer :: next => null ()
end type

type (real_list_t), target :: my_real_list
type (real_list_t), pointer :: real_list_temp

real_list_temp => my_real_list
do
  read (1,iostat=ioerr) real_list_temp%sample_data
  if (ioerr /= 0) exit
  allocate (real_list_temp%next)
  real_list_temp => real_list_temp%next
end do
Fortran-2003 adds support for procedure pointers. Also, as part of the C Interoperability feature, Fortran-2003 supports intrinsic functions for converting C-style pointers into Fortran pointers and back.

Go

Go has pointers. Its declaration syntax is equivalent to that of C, but written the other way around, ending with the type. Unlike C, Go has garbage collection, and disallows pointer arithmetic. Reference types, like in C++, do not exist. Some built-in types, like maps and channels, are boxed (i.e. internally they are pointers to mutable structures), and are initialized using the make function. As a different (than reference types) approach to unified syntax between pointers and non-pointers, the arrow (->) operator has been dropped—it is possible to use the dot operator directly on a pointer to a data type to access a field or method of the dereferenced value, as if the dot operator were used on the underlying data type. This, however, only works with 1 level of indirection.

Java

Unlike C, C++, or Pascal, there is no explicit representation of pointers in Java. Instead, more complex data structures likeobjects and arrays are implemented using references. The language does not provide any explicit pointer manipulation operators. It is still possible for code to attempt to dereference a null reference (null pointer), however, which results in a run-time exception being thrown. The space occupied by unreferenced memory objects is recovered automatically bygarbage collection at run-time.

Modula-2

Pointers are implemented very much as in Pascal, as are VAR parameters in procedure calls. Modula-2 is even more strongly typed than Pascal, with fewer ways to escape the type system. Some of the variants of Modula-2 (such as Modula-3) include garbage collection. For instance, when it comes to computing time, one must carry the remainder past 12.

Oberon

Much as with Modula-2, pointers are available. There are still fewer ways to evade the type system and so Oberon and its variants are still safer with respect to pointers than Modula-2 or its variants. As with Modula-3, garbage collection is a part of the language specification.

Pascal

Unlike many languages that feature pointers, standard ISO Pascal only allows pointers to reference dynamically created variables that are anonymous and does not allow them to reference standard static or local variables. It does not have pointer arithmetic. Pointers also must have an associated type and a pointer to one type is not compatible with a pointer to another type (e.g. a pointer to a char is not compatible with a pointer to an integer). This helps eliminate the type security issues inherent with other pointer implementations, particularly those used for PL/I or C. It also removes some risks caused by dangling pointers, but the ability to dynamically let go of referenced space by using the dispose standard procedure (which has the same effect as the free library function found in C) means that the risk of dangling pointers has not been entirely eliminated.
However, in some commercial and open source Pascal (or derivatives) compiler implementations —like Free Pascal,Turbo Pascal or the Object Pascal in Embarcadero Delphi— a pointer is allowed to reference standard static or local variables and can be cast from one pointer type to another. Moreover pointer arithmetic is unrestricted: adding or subtracting from a pointer moves it by that number of bytes in either direction, but using the Inc or Dec standard procedures with it moves the pointer by the size of the data type it is declared to point to. An untyped pointer is also provided under the name Pointer, which is compatible with other pointer types.

Perl

The Perl programming language supports pointers, although rarely used, in the form of the pack and unpack functions. These are intended only for simple interactions with compiled OS libraries. In all other cases, Perl uses references, which are typed and do not allow any form of pointer arithmetic. They are used to construct complex data structures.

No comments:

Post a Comment

 
Blogger Templates